Closed vyzo closed 7 years ago
Hmm what do you think is the added complexity of going up to a 100+ statement group size, versus 10? I find it a bit awkward to arbitrarily pick a size like 10 for something that's actually committed to the store.
Also, let's keep in mind the signing overheads, not just space savings.
Sorry, I meant client overhead (edited). The group size issue is that you will have to fetch (and merge) 100 objects when you may only be interested in 1.
And yes, signing overheads are important -- plus there is the verification cost that is reduced by the factor of the group size when merging large sets.
https://github.com/mediachain/aleph/pull/48 implements the requisite support in aleph.
There is a small ingestion performance improvement with compound statements. Ingestion time for 4mm batch from flickr:
real 12m30.530s
user 82m32.192s
sys 0m51.676s
which is down about a minute.
Ingesting the 60mm flickr data set with c=10 results in the following space usage:
$ du --si /mnt/ssd1/stmt /mnt/ssd2/data
25G /mnt/ssd1/stmt
22G /mnt/ssd2/data
$ mcclient query "SELECT COUNT(*) FROM *"
6000000
So we need about 4K per compound statement, with an amortized cost of about 417bytes/object.
Ingesting with c=100:
$ du --si /mnt/ssd1/stmt /mnt/ssd2/data
19G /mnt/ssd1/stmt
22G /mnt/ssd2/data
$ mcclient query "SELECT COUNT(*) FROM *"
600000
So with compound statements of 100 objects it comes at about 317 bytes/object.
We can expect another 50 bytes/object from the schema object dependency.
c=100 seems like a nice balance, feel good moving forward w/that
Ingestion with schema/c=100:
$ du --si /mnt/ssd1/stmt /mnt/ssd2/data
22G /mnt/ssd1/stmt
23G /mnt/ssd2/data
So it's inded 50 bytes/object, up to 367 bytes/object.
For v1.0 we only support publishing simple statements.
The capacity measurements in #48 indicate that the cost of each statement is around 1K, which results in the statement db being almost 3 times larger than the datastore. Nonetheless, this is not something unexpected; we anticipated the need for compressing statements (and amortizing signature costs) in the form of compound statements.
Compound statements allow multiple simple statement bodies to be grouped together in a single statement. This allows us to reduce the space consumption of the statement db, at the cost of object retrieval granularity: we can no longer retrieve a single statement body in a query by wki, while merging a compound statement will fetch the metadata for all objects in the group.
Let's roughly analyze the space cost of the statement db in order to appreciate the potential capacity increase by using a modest grouping of 10 statements that doesn't overwhelm the client.
The statement db has 3 tables:
By using compound statements of 10, the cost per object is reduced to:
So the amortized cost is about 370 bytes/object, which is almost a 3-fold decrease in our space requirements. This should bring the statement db storage down to the about same level as the datastore, and increases our capacity accordingly (for dual SSD setups).
Note that there is little point in further increasing the compound statement density -- a 100 statement grouping will result in saving maybe another 60 bytes/object at the cost of 10x increased client overhead. The gains are negligible for any further compression.