junneyang / zumastor

Automatically exported from code.google.com/p/zumastor
0 stars 1 forks source link

Allow using more compact btree nodes for < 16TB snapstores #140

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Right now we use 64 bit fields in the btree
so we can handle very large snapshot stores
This is a waste for small (<16 TB) snapshot 
stores.  Let's support a compact format
for these smaller snapstores.

This would reduce the impact of bug 5 somewhat.

Original issue reported on code.google.com by daniel.r...@gmail.com on 15 May 2008 at 11:14

GoogleCodeExporter commented 9 years ago
I would appreciate feedback on this approach to resolving issue 140.

Instead of using an additional format that uses 32bit values (which could
only be used on small volumes), I suggest an alternate format that will
support both 32bit and 64bit fields in a space efficient manner.

B-tree internal nodes must encode an array of pairs (keys and disk
addresses of other nodes).  Either may need to be >32bits for sufficiently
large backing devices or origin volumes.  If we steal 2 bits from each
32bit word, we can encode a pair in 8 to 16 bytes, depending on the values
encoded.  For each 32bit word, call the MSB bit the start of record bit,
and the next MSB bit the long field bit.  Using these 2 bits, we can encode
any pair of 30bit and 60bit values as such:

          0        32       64       96
u32 u32   10...    00...
u32 u64   10...    01...    00...
u64 u32   11...    00...    00...
u64 u64   11...    00...    01...    00...

The values stored in these fields are sector and chunk offsets, but as far
as I can tell, ddsnapd has to represent these values as byte offsets at
some point. So at most 55bits are useful.  This encoding provides 60bits
of storage, so does not impose additional limits.

Of course there are a couple drawbacks to this approach, but I believe they
are outweighed by the space savings. It is an on-disk format after all.
Specifically, values can no longer simply copied from disk and used, they
must be decoded.  Also, values are not necessarily aligned to their size,
but that will be corrected in the decode phase.  Note however, that this
format still allows for a binary search over the data - instead of dividing
the records in half, divide the memory in half and search for the record
(word with MSB set) closest to the center.

With this in mind I propose the following structure for b-tree nodes:

struct enode {
    u64 first_sector;
    u32 free_bytes;
    u32 child_entries[];
};

Where child_entries are encoded the above scheme.

The array of struct exception's in the leaf chunk can also be updated to
use the encoding.

--
Steve

Original comment by vand...@gmail.com on 2 Jun 2008 at 10:23