Store restart position in HDFS when using file data source

GoogleCodeExporter commented 9 years ago

1. To which tool/application/daemon will this feature apply?

Tungsten Replicator

2. Describe the feature in general

Tungsten Replicator stores the commitseqno.N containing restart position on the 
local file system for file data sources.  Since file data sources are used for 
loading data into Hadoop when we use the hadoop fs utility, it makes sense to 
push the data into Hadoop. 

3. Describe the feature interface

A new data source type will extend file data sources to push the commitseqno 
file into a backing persistent store.  This should become the default for HDFS 
loading. 

4. Give an idea (if applicable) of a possible implementation

This could be a new implementation of interface FileIO, which would work by 
adding a backing store to the current JavaFileIO class. 

5. Describe pros and cons of this feature.

5a. Why the world will be a better place with this feature.

HDFS is generally a safer place to put files than local file systems.  If the 
local FS is lost the replicator restart position will be irrevocably lost as 
well.  This would mean downtime for users, possibly prolonged, with complex 
manual recovery. 

5b. What hardship will the human race have to endure if this feature is
implemented.

Mostly a bit of work to implement. 

6. Notes

Original issue reported on code.google.com by robert.h...@continuent.com on 16 May 2014 at 4:58

GoogleCodeExporter commented 9 years ago

Here is a detailed design that solves the problem of storing files permanently 
in a backing store.  This design generalizes the current classes used to access 
files, which are located in package com.continuent.tungsten.common.file and 
contain implementations of the FileIo interface, a set of general operations on 
files and directories.  Clients principally use class JavaFileIo, implements 
operations on local files.  

A new FileIo implementation be provided that builds on the current JavaFileIo 
class.  In the new implementation local files are considered to be a cache for 
files stored in a backing store, which contains permanent copies. (See 
http://en.wikipedia.org/wiki/Cache_(computing) for general information on 
cache/backing store design.)  Clients of the FileIo class continue to operate 
on local files as before, with the following variations: 

1.) Files are tied to a backing store, which links backing storage to the local 
file system at a mount point.  Any local file below the mount point is 
considered to have a corresponding copy in the backing store.  The local file, 
if present, is a cached copy of that file. 

2.) The new FileIo implementation will mirror the contents of the backing store 
to local storage as needed using a set of primitive operations on the backing 
storage that amount to get, put, delete, and list commands. 

* A request to read a file or its properties (e.g. whether it is readable) will 
cause a fetch from the backing store to local storage if needed. 
* A request to list a path location will cause a fetch of the corresponding 
backing store file or directory. 
* A request to read directory properties (readable, exists) will cause the 
directory to be created locally. 
* A request to write a file will write the file to the backing store. 
* A request to remove a file will cause the file to be removed on the backing 
store. 
* A request to remove a directory will remove the directory and any children in 
the backing store. 

3.) The new FileIo implementation will make the simplifying assumption that 
there is only a single writer to the backing store.  This allows optimizations 
such as not fetching files multiple times if there have been no intervening 
writes. 

4.) Due to failures in distributed systems it will be possible to end up with 
two writers on a particular backing store location.  This can lead to 
corruption of file system contents.  This will be solved as follows using a 
leasing approach that establishes a temporary lock on the file system. 

a.) The head directory on the backing store will have a unique lease file that 
contains the following information: 
* client identifier -- A unique identifier such as host name + pid. 
* lease duration -- e.g. 5 minutes. 
* lease timestamp -- This may also be inferred from the time the file is 
written to the backing store. 
* crc -- Crc of preceding information to detect corruption. 

b.) Each time the FileIo implementation reads or writes to the backing store it 
will check the lease file.  If the file is present but belongs to another 
client *and* the lease has not expired, an error will be thrown.  Due to the 
leasing semantics the problem of client failures will be cured automatically as 
soon as the lease expires. 

c.) Otherwise it will update the lease file *and* read it back confirming the 
crc to mitigate race conditions from two distributed clients updating the lease 
simultaneously. 

4.) Failures in accessing the backing store can cause loss of synchronization 
in the following ways.  

* The local store can be ahead of the backing store if a write fails to 
propagate up to the backing store.  In this case we should discard the local 
store. 

* The local store can be behind the backing store.  In this case it is enough 
to refetch data.  

In practice it may not be easy to distinguish these cases so the first 
implementation will just clear the local store whenever it starts to avoid 
ambiguities. 

5.) This design allows one to implement backing store logic using simple 
command line tools like hadoop and s3cmd.  All you need is a utility that can 
put, fetch, and delete files, as well as list "directory" contents on the 
backing store. 

6.) The user interface additions visible to replicator end users will be very 
simple.  Users will supply an extra URL for the backing store location in 
replicator configuration files. 

Finally, it is worth adding a word about why this design approach seems best.  
For example, couldn't we just read and write directly to a backing store 
itself?  This would be problematic for the following reasons: 

1.) Local processes still need to operate on local files. This design allows us 
to reuse the robust read/write operations that have already been implemented. 

2.) Copying to and from local files leads to problems with local files getting 
out of sync with the backing store in ways that are difficult to anticipate.  
By formalizing the cache/backing store semantics and in particular failure 
handling we can eliminate such problems. 

3.) Reads and metadata operations like testing existence on the backing store 
are potentially extremely slow compared to operations on local storage (e.g., 
by a factor of 1000x).  The cache/backing store design allows optimization of 
read-only operations which is beneficial for high performance.  

4.) Each backing store is somewhat different and writing directly to the store 
means there would be substantial code particular to each store type, such as S3 
or HDFS accessed via hadoop commands.  This would complicate testing.  The 
proposed design allows us to factor the specific backing store operations into 
a set of simple primitives that are easy to test.  The overall backing store 
logic can be tested using local file systems.  This is vastly more 
testable--correctness can be largely ensured through unit tests. 

Q.E.D.

Original comment by robert.h...@continuent.com on 12 Jul 2014 at 5:39

GoogleCodeExporter commented 9 years ago

Please clarify a few points on how the synchronization of local file is done to 
the backing store:

RH> * A request to read a file or its properties (e.g. whether it is readable) 
will cause a fetch from the backing store to local storage if needed. 

How is "if needed" determined exactly?

RH> * A request to write a file will write the file to the backing store.

Is the write asynchronous or will the client need to wait until it has been 
written to backing store, which can take 1000x more time?

Original comment by linas.vi...@continuent.com on 14 Jul 2014 at 3:01

GoogleCodeExporter commented 9 years ago

Good questions.  

1. Fetch from backing storage on read.  My plan is to fetch only if we have not 
tried to fetch previously.  This is why the single-writer assumption is 
important--if we fetched the file and it is not there *and* we did not write it 
ourselves, there is no file, hence no need to look on the backing store.  The 
information the list of paths that have been fetched will be stored in an 
IndexedLRUCache instance. 

2. Writing to backing store.  This write is synchronous for now to ensure the 
cache and store do not get out of sync.  It's expensive--writing a file to HDFS 
using hadoop can take well over a second due to process start up, network 
connectivity, etc.  If it's asynchronous we have to beware of problems with 
ordering and failures that could lead to hard-to-understand inconsistencies.  
For now we will assume that the cost of writes is not a big issue because they 
do not occur very often.  That meets requirements of the current use case for 
persistently storing restart points for Hadoop loading.

Original comment by robert.h...@continuent.com on 14 Jul 2014 at 6:05

GoogleCodeExporter commented 9 years ago

In effect, each transaction will have a delay as it updates current seqno?

Original comment by linas.vi...@continuent.com on 15 Jul 2014 at 7:58

GoogleCodeExporter commented 9 years ago

That is correct.  The file-based seqno is only used currently in batch loading 
cases, where the extra write overhead is a small cost on top of writing to HDFS 
or S3.  

For instance, suppose we are replicating into Hadoop using the hadoop.js merge 
script.  Let's say that the batch commit interval is 5 minutes and we load 
transactions for 100 tables.  There will be one write to the backing store to 
update the commit seqno.  This adds 1 second or so of processing time on top of 
the 5 minutes + load time, hence around 0.3% overhead.  (Reads on the other 
hand are off the local file system after the first fetch operation.) 

This could obviously be abused if callers to the FileIo implementation assume 
writes are cheap.  That would never be the case, however, if you were storing 
data on S3 or HDFS accessed via the hadoop command.  Obviously this should be 
documented properly in Javadoc.

Original comment by robert.h...@continuent.com on 15 Jul 2014 at 12:09

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 11 Aug 2014 at 1:21

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 19 Dec 2014 at 7:03

Added labels: FixedIn-3.1.0
Removed labels: FixedIn-3.0.0

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 19 Jan 2015 at 2:18

Added labels: FixedIn-4.0.1
Removed labels: FixedIn-3.1.0

is00hcw / tungsten-replicator

Store restart position in HDFS when using file data source #911