Open GoogleCodeExporter opened 9 years ago
Here is a detailed design that solves the problem of storing files permanently
in a backing store. This design generalizes the current classes used to access
files, which are located in package com.continuent.tungsten.common.file and
contain implementations of the FileIo interface, a set of general operations on
files and directories. Clients principally use class JavaFileIo, implements
operations on local files.
A new FileIo implementation be provided that builds on the current JavaFileIo
class. In the new implementation local files are considered to be a cache for
files stored in a backing store, which contains permanent copies. (See
http://en.wikipedia.org/wiki/Cache_(computing) for general information on
cache/backing store design.) Clients of the FileIo class continue to operate
on local files as before, with the following variations:
1.) Files are tied to a backing store, which links backing storage to the local
file system at a mount point. Any local file below the mount point is
considered to have a corresponding copy in the backing store. The local file,
if present, is a cached copy of that file.
2.) The new FileIo implementation will mirror the contents of the backing store
to local storage as needed using a set of primitive operations on the backing
storage that amount to get, put, delete, and list commands.
* A request to read a file or its properties (e.g. whether it is readable) will
cause a fetch from the backing store to local storage if needed.
* A request to list a path location will cause a fetch of the corresponding
backing store file or directory.
* A request to read directory properties (readable, exists) will cause the
directory to be created locally.
* A request to write a file will write the file to the backing store.
* A request to remove a file will cause the file to be removed on the backing
store.
* A request to remove a directory will remove the directory and any children in
the backing store.
3.) The new FileIo implementation will make the simplifying assumption that
there is only a single writer to the backing store. This allows optimizations
such as not fetching files multiple times if there have been no intervening
writes.
4.) Due to failures in distributed systems it will be possible to end up with
two writers on a particular backing store location. This can lead to
corruption of file system contents. This will be solved as follows using a
leasing approach that establishes a temporary lock on the file system.
a.) The head directory on the backing store will have a unique lease file that
contains the following information:
* client identifier -- A unique identifier such as host name + pid.
* lease duration -- e.g. 5 minutes.
* lease timestamp -- This may also be inferred from the time the file is
written to the backing store.
* crc -- Crc of preceding information to detect corruption.
b.) Each time the FileIo implementation reads or writes to the backing store it
will check the lease file. If the file is present but belongs to another
client *and* the lease has not expired, an error will be thrown. Due to the
leasing semantics the problem of client failures will be cured automatically as
soon as the lease expires.
c.) Otherwise it will update the lease file *and* read it back confirming the
crc to mitigate race conditions from two distributed clients updating the lease
simultaneously.
4.) Failures in accessing the backing store can cause loss of synchronization
in the following ways.
* The local store can be ahead of the backing store if a write fails to
propagate up to the backing store. In this case we should discard the local
store.
* The local store can be behind the backing store. In this case it is enough
to refetch data.
In practice it may not be easy to distinguish these cases so the first
implementation will just clear the local store whenever it starts to avoid
ambiguities.
5.) This design allows one to implement backing store logic using simple
command line tools like hadoop and s3cmd. All you need is a utility that can
put, fetch, and delete files, as well as list "directory" contents on the
backing store.
6.) The user interface additions visible to replicator end users will be very
simple. Users will supply an extra URL for the backing store location in
replicator configuration files.
Finally, it is worth adding a word about why this design approach seems best.
For example, couldn't we just read and write directly to a backing store
itself? This would be problematic for the following reasons:
1.) Local processes still need to operate on local files. This design allows us
to reuse the robust read/write operations that have already been implemented.
2.) Copying to and from local files leads to problems with local files getting
out of sync with the backing store in ways that are difficult to anticipate.
By formalizing the cache/backing store semantics and in particular failure
handling we can eliminate such problems.
3.) Reads and metadata operations like testing existence on the backing store
are potentially extremely slow compared to operations on local storage (e.g.,
by a factor of 1000x). The cache/backing store design allows optimization of
read-only operations which is beneficial for high performance.
4.) Each backing store is somewhat different and writing directly to the store
means there would be substantial code particular to each store type, such as S3
or HDFS accessed via hadoop commands. This would complicate testing. The
proposed design allows us to factor the specific backing store operations into
a set of simple primitives that are easy to test. The overall backing store
logic can be tested using local file systems. This is vastly more
testable--correctness can be largely ensured through unit tests.
Q.E.D.
Original comment by robert.h...@continuent.com
on 12 Jul 2014 at 5:39
Please clarify a few points on how the synchronization of local file is done to
the backing store:
RH> * A request to read a file or its properties (e.g. whether it is readable)
will cause a fetch from the backing store to local storage if needed.
How is "if needed" determined exactly?
RH> * A request to write a file will write the file to the backing store.
Is the write asynchronous or will the client need to wait until it has been
written to backing store, which can take 1000x more time?
Original comment by linas.vi...@continuent.com
on 14 Jul 2014 at 3:01
Good questions.
1. Fetch from backing storage on read. My plan is to fetch only if we have not
tried to fetch previously. This is why the single-writer assumption is
important--if we fetched the file and it is not there *and* we did not write it
ourselves, there is no file, hence no need to look on the backing store. The
information the list of paths that have been fetched will be stored in an
IndexedLRUCache instance.
2. Writing to backing store. This write is synchronous for now to ensure the
cache and store do not get out of sync. It's expensive--writing a file to HDFS
using hadoop can take well over a second due to process start up, network
connectivity, etc. If it's asynchronous we have to beware of problems with
ordering and failures that could lead to hard-to-understand inconsistencies.
For now we will assume that the cost of writes is not a big issue because they
do not occur very often. That meets requirements of the current use case for
persistently storing restart points for Hadoop loading.
Original comment by robert.h...@continuent.com
on 14 Jul 2014 at 6:05
In effect, each transaction will have a delay as it updates current seqno?
Original comment by linas.vi...@continuent.com
on 15 Jul 2014 at 7:58
That is correct. The file-based seqno is only used currently in batch loading
cases, where the extra write overhead is a small cost on top of writing to HDFS
or S3.
For instance, suppose we are replicating into Hadoop using the hadoop.js merge
script. Let's say that the batch commit interval is 5 minutes and we load
transactions for 100 tables. There will be one write to the backing store to
update the commit seqno. This adds 1 second or so of processing time on top of
the 5 minutes + load time, hence around 0.3% overhead. (Reads on the other
hand are off the local file system after the first fetch operation.)
This could obviously be abused if callers to the FileIo implementation assume
writes are cheap. That would never be the case, however, if you were storing
data on S3 or HDFS accessed via the hadoop command. Obviously this should be
documented properly in Javadoc.
Original comment by robert.h...@continuent.com
on 15 Jul 2014 at 12:09
Original comment by linas.vi...@continuent.com
on 11 Aug 2014 at 1:21
Original comment by linas.vi...@continuent.com
on 19 Dec 2014 at 7:03
Original comment by linas.vi...@continuent.com
on 19 Jan 2015 at 2:18
Original issue reported on code.google.com by
robert.h...@continuent.com
on 16 May 2014 at 4:58