liuis / leveldb

Automatically exported from code.google.com/p/leveldb
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Feature request: stream into the database #89

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
For IndexedDB in Chrome, we need to be able to store large binary blobs in our 
LevelDB instance.  However, we don't want to have to have each entire blob in 
memory at once; these could be multiple-gigabyte objects.  We're fine with 
chopping them into chunks manually and adding them under separate keys, but we 
still want all chunks to be committed in a single atomic action.

Currently WriteBatch is an in-memory construct, so even if we break up the Blob 
into chunks, it's all going to be kept in memory anyway.  I'd like to request 
that you add either a WriteBatch that spills to disk as you build it, or some 
other way to stream large data into the db.

When reading it back out, we'll have the same contraints--we can't afford to 
have the whole thing in memory, and will just want to grab a limited byte range 
at a time.

Original issue reported on code.google.com by ericu@chromium.org on 8 May 2012 at 10:31

GoogleCodeExporter commented 9 years ago
Hmm... That would seem to violate one of LevelDB's design contracts: all writes 
go to Level 0, which is in memory.

Could you instead snapshot the database, do the writes individually, then toss 
away the snapshot?

Original comment by cbsm...@gmail.com on 8 May 2012 at 10:38

GoogleCodeExporter commented 9 years ago
Then we'd have to prevent anyone else from modifying any part of the database 
while we were streaming in a potentially-huge blob, which would be awkward.  
And if we crashed while doing the streaming, we'd have to do manually validate 
our state and clean up the next time we started up, which is exactly what 
transactions are supposed to prevent you from having to do.

But about that contract: where is that written?  I just looked around and found 
http://leveldb.googlecode.com/svn/trunk/doc/impl.html, which doesn't say 
anything about level 0 having to be in memory.  We'll quickly blow past the 1MB 
limit for level 0 anyway.  And is that level-0 table for things that are 
already committed, or things being prepped for commit [in a WriteBatch like 
what I'm talking about]?

Original comment by ericu@chromium.org on 8 May 2012 at 10:54

GoogleCodeExporter commented 9 years ago
For Chromium's IndexedDB implementation we could construct multiple leveldb 
WriteBatches per IDB transaction, we just don't currently do so. I don't think 
believe leveldb would be limiting us here.

Original comment by jsb...@google.com on 9 May 2012 at 12:29

GoogleCodeExporter commented 9 years ago
My bad. It's not Level-0, but the memtable. If your records are that big they'd 
effectively cause churning on the memtables. One could conceivably have an API 
that bypasses the memtables, but you'd still have pretty massive churn as it 
seems likely you'd trigger compactions of Level-0 and Level-1, if not Level-2. 
The net effect of all this might not be that different from locking down 
updates to the database while you stuff in this giant record.

Would you then pass in these bits of memory as mmap'd memory regions to avoid 
them necessarily being pulled in to RAM?

It almost seems like after a certain size, you would be much better off having 
a blob store which is outside of LevelDB: write to the filesystem and merely 
store the path to the file in LevelDB. This is something that could be 
constructed as a layer on top of LevelDB. Just a thought.

Original comment by cbsm...@gmail.com on 9 May 2012 at 1:28

GoogleCodeExporter commented 9 years ago
Re: mmap--It's probably not necessary to make the API that complicated.  We'd 
just keep the chunks reasonably small.

We'd like to avoid having a separate blob store.  The goal is to have 
everything live in a single database for atomicity of transactions.

Original comment by ericu@google.com on 9 May 2012 at 2:51

GoogleCodeExporter commented 9 years ago
> Currently WriteBatch is an in-memory construct, so even if we break up the
> Blob into chunks, it's all going to be kept in memory anyway. I'd like to
> request that you add either a WriteBatch that spills to disk as you build
> it, or some other way to stream large data into the db.

This will add a lot of complication to the leveldb implementation.

> Re: mmap--It's probably not necessary to make the API that complicated.
>  We'd just keep the chunks reasonably small.

That may be true for your project, but perhaps other users of leveldb
will want the decision to made differently.  I am disinclined to add
complications and options to leveldb if there is a reasonable way for
the application to achieve what it wants at a higher level.

> We'd like to avoid having a separate blob store.  The goal is to have
> everything live in a single database for atomicity of transactions.

You can do something fairly straightforward without any leveldb
changes.  Here is a sketch of one such implementation.  Suppose you
want to write blob B under key K.  Also suppose that B is split up
into n reasonably sized chunks B1..Bn.

  (1) First store blob chunks directly in the filesystem, one chunk per
  file, with a unique id used to determine the file name Fi for each chunk.

  (2) Use a normal leveldb op to store the filenames F1..Fn under key
  K.  You can atomically commit multiple blobs here if you need to.
  Also, to speed up step (3), in a small reserved section of the
  keyspace, record a mapping like this
    file:K,F => S
  where K is the key, F is the file name, and S is the file size.

  (3) Do a periodic garbage collection pass:
  Delete all files F in the file system unless you can find
  an entry of the form "file:K,F=>S" in leveldb and the corresponding
  leveldb entry for K contains F.

  (4) Augment your calls to DB::GetApproximateSizes() so that when
  computing the size for range [A..B), you scan over the range
  [file:A, .. file:B,) and add up the sizes you find.

By storing the large values directly in the file system, you avoid
paying any compaction overhead where multi-GB values get rewritten a
few times and/or cause lots of background compaction work.

Some possible improvements that may or may not be worthwhile:

(a) Use the SHA1 hash of the chunk contents for the filename so
that if the same value is written multiple times, you just need
to store one file.  GC will become marginally more complicated.

(b) You can keep a copy of the filenames and associated keys in memory
so that you can delete a file immediately after overwriting or
deleting a key.  Though you will still need a GC phase in case your
process crashes just after the key overwrite/deletion completes.

(c) You can wrap this whole logic in a brand new class that implements
the leveldb::DB interface so that the code that calls leveldb does not
need to change.

(d) If you really wish, you could store the blob chunks in a reserved
key range in leveldb instead of in the file system.  Just be careful
that when you garbage collect, you can do so without having to pull
the blob chunks into memory.

Original comment by san...@google.com on 9 May 2012 at 5:00

GoogleCodeExporter commented 9 years ago
Glad to see you say this Sanjay!

LevelDB's persistence model is really in conflict with such large data values 
anyway. IndexedDB, being an API rather than an implementation, requires a more 
complex interface as it is more of a document store (something more like 
CouchBase) than a record store (something more like LevelDB), but could serve 
as either. Looking at the IndexedDB API, you would be much better served by 
storing values above a certain size (a cutoff in the 1K-4K range I should 
think, but a not unreasonable approach would be to have the cutoff effectively 
be 0 bytes and store only indexes/keys inside LevelDB) in a blob store. Using a 
separate streaming API where you break things up in to chunks is going to add 
far more complexity to your code anyway.

I'd actually be tempted to suggest you don't even split the data up: just store 
the data as a single blob. Reserve a lead byte (or bit) in your value records 
to indicate whether you have an external blob (in which case the rest of the 
value is the path) or an internally stored record (in which case the rest of 
the value is the actual data). You can potentially avoid the need for Sanjay's 
garbage collection if you write initially to a tempfile, rename/hard link after 
inserting a LevelDB record, and treat records pointing to non-existent paths as 
"never completed commit".

The slice interface LevelDB provides should allow internally stored records to 
still be accessed quite efficiently, and would avoid significant performance 
complications stemming from the large records. I think you could even store the 
blobs in an appropriately named subdirectory inside the LevelDB database 
directory.

You're going to need something like this indirection mechanism for your indexed 
fields anyway (perhaps for efficiency with these you'd store both the path and 
an offset in to the file for the attribute in question).

Original comment by cbsm...@gmail.com on 9 May 2012 at 8:50