Compound Statements for Large Datasets

vyzo commented 7 years ago

For v1.0 we only support publishing simple statements.

The capacity measurements in #48 indicate that the cost of each statement is around 1K, which results in the statement db being almost 3 times larger than the datastore. Nonetheless, this is not something unexpected; we anticipated the need for compressing statements (and amortizing signature costs) in the form of compound statements.

Compound statements allow multiple simple statement bodies to be grouped together in a single statement. This allows us to reduce the space consumption of the statement db, at the cost of object retrieval granularity: we can no longer retrieve a single statement body in a query by wki, while merging a compound statement will fetch the metadata for all objects in the group.

Let's roughly analyze the space cost of the statement db in order to appreciate the potential capacity increase by using a modest grouping of 10 statements that doesn't overwhelm the client.

The statement db has 3 tables:

The Statement table, which contains ids and data; the cost of this table is about 400 bytes/stmt
The Envelope, which indexes statement header fields. It has 2 indexes (Statement-id, and namespace) and comes at a cost about the same as the statement table: about 400bytes/stmt
The Reference table, which relates statements to wki and has 2 indexes (Statement-id, wki). Cost is about 200 bytes/object

By using compound statements of 10, the cost per object is reduced to:

~130 bytes/object in the Statement table: statement size is increased by ~9x100 bytes for wki/object references, but we only need one statement per 10 objects)
~40 bytes/object in the Envelope table: we store 1/10th of the statements, with linear decrease in cost
~200 bytes/object in the References table; this doesn't depend on the number of statements, so it's unchanged

So the amortized cost is about 370 bytes/object, which is almost a 3-fold decrease in our space requirements. This should bring the statement db storage down to the about same level as the datastore, and increases our capacity accordingly (for dual SSD setups).

Note that there is little point in further increasing the compound statement density -- a 100 statement grouping will result in saving maybe another 60 bytes/object at the cost of 10x increased client overhead. The gains are negligible for any further compression.

parkan commented 7 years ago

Hmm what do you think is the added complexity of going up to a 100+ statement group size, versus 10? I find it a bit awkward to arbitrarily pick a size like 10 for something that's actually committed to the store.

Also, let's keep in mind the signing overheads, not just space savings.

vyzo commented 7 years ago

Sorry, I meant client overhead (edited). The group size issue is that you will have to fetch (and merge) 100 objects when you may only be interested in 1.

And yes, signing overheads are important -- plus there is the verification cost that is reduced by the factor of the group size when merging large sets.

vyzo commented 7 years ago

https://github.com/mediachain/aleph/pull/48 implements the requisite support in aleph.

vyzo commented 7 years ago

There is a small ingestion performance improvement with compound statements. Ingestion time for 4mm batch from flickr:

real    12m30.530s
user    82m32.192s
sys     0m51.676s

which is down about a minute.

vyzo commented 7 years ago

Ingesting the 60mm flickr data set with c=10 results in the following space usage:

$ du --si /mnt/ssd1/stmt /mnt/ssd2/data
25G     /mnt/ssd1/stmt
22G     /mnt/ssd2/data

$ mcclient query "SELECT COUNT(*) FROM *"
6000000

So we need about 4K per compound statement, with an amortized cost of about 417bytes/object.

vyzo commented 7 years ago

Ingesting with c=100:

$ du --si /mnt/ssd1/stmt /mnt/ssd2/data
19G     /mnt/ssd1/stmt
22G     /mnt/ssd2/data

$ mcclient query "SELECT COUNT(*) FROM *"
600000

So with compound statements of 100 objects it comes at about 317 bytes/object.

vyzo commented 7 years ago

We can expect another 50 bytes/object from the schema object dependency.

parkan commented 7 years ago

c=100 seems like a nice balance, feel good moving forward w/that

vyzo commented 7 years ago

Ingestion with schema/c=100:

$ du --si /mnt/ssd1/stmt /mnt/ssd2/data
22G     /mnt/ssd1/stmt
23G     /mnt/ssd2/data

So it's inded 50 bytes/object, up to 367 bytes/object.

mediachain / concat

Compound Statements for Large Datasets #62