keroscarel / s3backer

Automatically exported from code.google.com/p/s3backer
GNU General Public License v2.0
0 stars 0 forks source link

s3backer - FUSE-based single file backing store via Amazon S3

Overview s3backer is a filesystem that contains a single file backed by the Amazon Simple Storage Service (Amazon S3). As a filesystem, it is very simple: it provides a single normal file having a fixed size. Underneath, the file is divided up into blocks, and the content of each block is stored in a unique Amazon S3 object. In other words, what s3backer provides is really more like an S3-backed virtual hard disk device, rather than a filesystem.

In typical usage, a `normal' filesystem is mounted on top of the file exported by the s3backer filesystem using a loopback mount (or disk image mount on Mac OS X).

This arrangement has several benefits compared to more complete S3 filesystem implementations:

o By not attempting to implement a complete filesystem, which is a com- plex undertaking and difficult to get right, s3backer can stay very lightweight and simple. Only three HTTP operations are used: GET, PUT, and DELETE. All of the experience and knowledge about how to properly implement filesystems that already exists can be reused.

o By utilizing existing filesystems, you get full UNIX filesystem semantics. Subtle bugs or missing functionality relating to hard links, extended attributes, POSIX locking, etc. are avoided.

o The gap between normal filesystem semantics and Amazon S3 ``eventual consistency'' is more easily and simply solved when one can interpret S3 objects as simple device blocks rather than filesystem objects (see below).

o When storing your data on Amazon S3 servers, which are not under your control, the ability to encrypt and authenticate data becomes a crit- ical issue. s3backer supports secure encryption and authentication. Alternately, the encryption capability built into the Linux loopback device can be used.

o Since S3 data is accessed over the network, local caching is also very important for performance reasons. Since s3backer presents the equivalent of a virtual hard disk to the kernel, most of the filesys- tem caching can be done where it should be: in the kernel, via the kernel's page cache. However s3backer also includes its own internal block cache for increased performance, using asynchronous worker threads to take advantage of the parallelism inherent in the network.

Consistency Guarantees Amazon S3 makes relatively weak guarantees relating to the timing and consistency of reads vs. writes (collectively known as ``eventual consis- tency''). s3backer includes logic and configuration parameters to work around these limitations, allowing the user to guarantee consistency to whatever level desired, up to and including 100% detection and avoidance of incorrect data. These are:

  1. s3backer enforces a minimum delay between consecutive PUT or DELETE operations on the same block. This ensures that Amazon S3 doesn't receive these operations out of order.

  2. s3backer maintains an internal block MD5 checksum cache, which enables automatic detection and rejection of `stale' blocks returned by GET operations.

    This logic is configured by the following command line options: --md5CacheSize, --md5CacheTime, and --minWriteDelay.

Zeroed Block Optimization As a simple optimization, s3backer does not store blocks containing all zeroes; instead, they are simply deleted. Conversely, reads of non-exis- tent blocks will contain all zeroes. In other words, the backed file is always maximally sparse.

As a result, blocks do not need to be created before being used and no special initialization is necessary when creating a new filesystem.

When the --listBlocks flag is given, s3backer will list all existing blocks at startup so it knows ahead of time exactly which blocks are empty.

File and Block Size Auto-Detection As a convenience, whenever the first block of the backed file is written, s3backer includes as meta-data (in the ``x-amz-meta-s3backer-filesize'' header) the total size of the file. Along with the size of the block itself, this value can be checked and/or auto-detected later when the filesystem is remounted, eliminating the need for the --blockSize or --size flags to be explicitly provided and avoiding accidental mis-inter- pretation of an existing filesystem.

Block Cache s3backer includes support for an internal block cache to increase perfor- mance. The block cache cache is completely separate from the MD5 cache which only stores MD5 checksums transiently and whose sole purpose is to mitigate ``eventual consistency''. The block cache is a traditional cache containing cached data blocks. When full, clean blocks are evicted as necessary in LRU order.

Reads of cached blocks will return immediately with no network traffic. Writes to the cache also return immediately and trigger an asynchronous write operation to the network via a separate worker thread. Because the kernel typically writes blocks through FUSE filesystems one at a time, performing writes asynchronously allows s3backer to take advantage of the parallelism inherent in the network, vastly improving write performance.

The block cache can be configured to store the cached data in a local file instead of in memory. This permits larger cache sizes and allows s3backer to reload cached data after a restart. Reloaded data is veri- fied before reuse.

The block cache is configured by the following command line options: --blockCacheFile, --blockCacheNoVerify, --blockCacheSize, --blockCacheSync, --blockCacheThreads, --blockCacheTimeout, and --blockCacheWriteDelay.

Read Ahead s3backer implements a simple read-ahead algorithm in the block cache. When a configurable number of blocks are read in order, block cache worker threads are awoken to begin reading subsequent blocks into the block cache. Read ahead continues as long as the kernel continues read- ing blocks sequentially. The kernel typically requests blocks one at a time, so having multiple worker threads already reading the next few blocks improves read performance by taking advantage of the parallelism inherent in the network.

Note that the kernel implements a read ahead algorithm as well; its behavior should be taken into consideration. By default, s3backer passes the -o max_readahead=0 option to FUSE.

Read ahead is configured by the --readAhead and --readAheadTrigger com- mand line options.

Encryption and Authentication s3backer supports encryption via the --encrypt, --password, and --passwordFile flags. When encryption is enabled, SHA1 HMAC authentica- tion is also automatically enabled, and s3backer rejects any blocks that are not properly encrypted and signed.

Encrypting at the s3backer layer is preferable to encrypting at an upper layer (e.g., at the loopback device layer), because if the data s3backer sees is already encrypted it can't optimize away zeroed blocks or do meaningful compression.

Read-Only Access An Amazon S3 account is not required in order to use s3backer. The filesystem must already exist and have S3 objects with ACL's configured for public read access (see --accessType below); users should perform the looback mount with the read-only flag (see mount(8)) and provide the --readOnly flag to s3backer. This mode of operation facilitates the cre- ation of public, read-only filesystems.

Simultaneous Mounts Although it functions over the network, the s3backer filesystem is not a distributed filesystem and does not support simultaneous read/write mounts. (This is not something you would normally do with a hard-disk partition either.) s3backer does not detect this situation; it is up to the user to ensure that it doesn't happen.

Statistics File s3backer populates the filesystem with a human-readable statistics file. See --statsFilename below.

Logging In normal operation s3backer will log via syslog(3). When run with the -d or -f flags, s3backer will log to standard error.


Home page: http://s3backer.googlecode.com/


See INSTALL for installation instructions. After installing, see the s3backer(1) man page for how to run it.

See COPYING for license.

See CHANGES for change history.

Enjoy!

$Id$