Closed aditsachde closed 2 years ago
Rewriting it in Rust sounds interesting. Do you plan to use existing libraries as a basis or do it from scratch?
I'd like to use existing libraries as a base as much as possible, but there are some missing pieces such as a spec-compliant HAMT implementation that I intend to write from scratch.
This microgrant has been approved! Look out for an email to submit payment details. Let's use your existing Community Discussion thread for updates.
This project ended up taking longer than expected, due to three rewrites to try and get performance to acceptable levels. This ended up being a lot more challenging than anticipated, as detailed below. I may come back to this project as the IPLD ecosystem matures, but for now, it is mostly complete and hopefully useful for others to learn from.
Building a HAMT is done in three steps. First, clone the repo and run cargo build --release
Load the Data
Start by modifying src/bin/load_data_ser_blocks.rs
to match the format of your dataset.
Run the following command. The last arg is optional, which is useful when building multiple HAMTs with different keys from the same dataset.
target/release/load_data_ser_blocks <block_db> <data file> <number of records> <block car>
Build the Tree Run the following command. Note down the Root CID outputted at the end of this step, as the CID outputted when importing the generated .car files is not the same as this.
target/release/build_tree_par <block_db> <tree_db> <width> <bucket size>
For small HAMTs (less than a thousand or so entries), this may result in an error. Instead, run the single threaded version.
target/release/build_tree <block_db> <tree_db> <width> <bucket size>
Serialize the Tree
target/release/serialize_tree_car <tree_db> <tree car>
Run a query This can be done using any standard HAMT library in any language. However, there is a small demo provided. It requires a local IPFS daemon.
Import the block and tree cars generated from the previous steps into the local node by doing ipfs dag import <block car>
and ipfs dag import <tree car>
. Make sure that the local IPFS daemon is using the badger block store (ipfs init --profile=badgerds
) as the import will be extremely slow otherwise.
target/release/query_key <root_cid> <key>
If this record exists in the HAMT, a CID will be returned. ipfs dag get <cid>
can be used to retrieve the corresponding JSON record.
Datasets There are two pregenerated datasets avaliable. They can be directly queried without importing the car files from the provided locations as the blocks are present on the IPFS network. However, this will likely be unusably slow depending on the number of copies of the tree present in the network.
About two million records, avaliable on the releases page.
bafyreiatlxnr6ilfyxuphywglyxsnxzobfpxk5fbjk4lef7bawirbsp2x4
All records (over 68 million), avaliable on IPFS as QmXiGQJn1JHNp339WUvPAtqfcRh9u1vdUp8L5s2HtDVqQE
bafyreiblsloewhro5civi25yptqb2mosxkpc6py6be6o77rnv4krsblymq
The goal of this project was to demonstrate and provide tools for querying a dataset on IPFS by some key. Generally with IPFS, a client would download a dataset, such as a sqlite file, and then perform a lookup. This works fine for small datasets, but what about very large ones, such as the complete Open Library data dump.
IPFS provides a solution to this in the form of Interplantery Linked Data, and one its specs, a hash map array trie (HAMT), adapted for IPFS. The concept is quite interesting, but actually using it can be difficult, especially for such a large dataset.
IPLD's HAMT can be thought of as a merkle tree where operations are performed on the hash of the key to provide even distribution. This project ended up being implemented three different times to provide acceptable performance for the demonstration dataset, the Open Library dump. It is 50GB uncompressed, with over 68 million records.
Follow the spec exactly. This works, but its not very useful for such a large HAMT. For every insert, the node is serialized, its parent is updated with the new hash, the parent is serialized, and so on and so forth up to the root. The very next insert does the same thing and changes the hash. Unless we need every intermediate stage, we do a lot of work that is immediatly thrown away.
Model the tree as such. Implement it as a traditional HAMT in memory, using pointers instead of hashes the way IPLD requires. Once all entries are inserted, convert it in a single step.
pub struct Node<V: Serialize> {
pub map: BitVec<Lsb0, u8>,
pub data: Vec<Element<V>>,
}
pub enum Element<V: Serialize> {
HashMapNode(HashMapNode<V>),
Bucket(Vec<BucketEntry<V>>),
}
struct BucketEntry<V: Serialize> {
key: Vec<u8>,
value: V,
}
This is better, but it is still slow. The obvious issues are that its single threaded and memory usage will be very high, since everything gets loaded into memory. The biggest issue is pointer chasing. To traverse a node, the map has to be fetched, then the data vector, and then finally the next node. Three pointers have to be dereferenced per node, and for every insertion, there are multiple nodes that are traversed.
Why does this end up being such an issue though? How come langages like Java can be fast while still storing everything on the heap behind pointers? It turns out fetching something from main memory is very slow compared on CPU caches, and HAMTs destroy their hit ratios. Even if keys are sequentially inserted, they are inserted in the tree based on their hash, which is essentially random. This means that for every insert, a random set of nodes are needed and caches are quite small, so nodes tend to be evicted before they're used again.
Model the tree as such. Rust recently gained support for const generics, allowing structs to hold arrays with sizes known at compile time, allowing them to be inlined instead of requiring a pointer.
pub struct Node<V, const N: usize> {
pub map: [bool; N],
pub data: [Element<V>; N],
}
pub enum Element<V, const N: usize> {
HashMapNode(HashMapNode<V, N>),
Bucket(Vec<BucketEntry<V>>),
}
struct BucketEntry<V, const N: usize> {
key: Vec<u8>,
value: V,
}
This solves the entire pointer chasing issue. However, each node ends up being massive (10s of KB), despite most of the nodes being almost empty. In the previous scenario, an on disk store for nodes would have been nice, but its possible to still have enough memory on a standard computer to use it for a large dataset. For this, an on disk store for nodes is required to be useful (or a VPS with hundreds of gigs of ram). Additionally, as mentioned in the previous section, due to the randomness of the key hashes, a couple blocks would inevitably end up having to be fetched from disk.
All of these ways have some sort of issue that causes slowdowns or a lot of complexity. At this point, it was time to take a step back and reevaluate the project. The original goal was a complete IPLD HAMT library that could support building large trees, perform partial updates to an existing HAMT, and a variety of other supporting functions efficiently. More details are in the lessons learned section, but I decided to focus on doing a small portion well.
Like many others, the Open Library dataset is a bunch of JSON records. Instead of storing the record directly in the HAMT, it is better to first serialize the record into its own document, and then store the CID in the HAMT. This way, its possible to have multiple HAMTs with different keys for lookup backed by the same, deduplicated dataset in IPFS.
The steps for building a HAMT are as follows:
Lets reorder the steps:
This only does a small portion of the previous approaches, but it fixes all the issues for that portion. Steps 1 and 2 can be trivially parallerlized. Due to step 3, we can go back to modeling nodes with vectors and pointers. Since inserts are done by hash order, the correct nodes end up in cache and pointers are fast again.
A HAMT node also has multiple children. Iterating by hash order also makes it obvious upfront which child of the root a node will go into. Therefore, each child of the root can be treated as its own tree, each of which can be done on its own core, allowing for parallelism.
For a dataset of 8 million keys, the preprocessing step took about 1 minute 30 seconds. Building the tree itself took under 25 seconds.
Unfortunatly, the preprocessign step does not scale linearly. On my machine, due to some combination of sled (the underlying KV store), the way B-trees work, and a relativily slow ssd, for a dataset of 68 million keys, the preprocessing step took around 35 minutes. Building the tree itself took a bit under 5 minutes.
Note that the preprocessing step is required. Sled stores the KV pairs sorted lexiographically. Since hashes are random, doing this requires a lot of additional work. However, the lexiographic ordering is what makes everything else possible to do in a reasonable amount of time. Additionally, the preprocessing step includes serializing the dataset to a car file. This part can be omitted on subsequent runs, speeding up this step.
Overall, I'm happy with this. It is still much, much faster than the other implementations. There are still various places where it can be algorithemically improved, but I choose not to implement these as the bottleneck was IO.
1. What is your project? (max 100 words)
IPLD Distributed Datasets provides tooling for storing datasets using IPLD advanced data types.
IPLD can be used to store datasets in a way that lets clients perform lookups in a decentralized manner. Clients only fetch the specific amount of data required to find and validate a key/value lookup. However, libraries and tooling around these concepts do not currently exist, and this project aims to change that.
https://github.com/aditsachde/ipld-distributed-datasets
The linked repository has a demo implementation.
2. How are you planning to improve this project? (max 200 words)
3. Will the work be Open Source?
Yes, the project is MIT licensed
4. If selected, do you agree to complete weekly updates and a grant report upon conclusion?
Yes
5. Does your proposal comply with our Community Code of Conduct?
Yes
6. Links and submissions
Have you submitted to the IPFS Community Showcase (instructions)?
Have you added your project to the IPFS Ecosystem Directory? Yes
If your project began at a hackathon, have you submitted it for the relevant Protocol Labs prizes? Include links here if available: N/A, did not start at hackathon
Have you filled out the Interplanetary Builder Feedback Survey? N/A, did not start at hackathon
Additional questions:
For each team member(s), please list name, email, Github account, and role in the project. Adit Sachde contact [at] aditsachde.com @aditsachde Developer
If your project was created as part of an event or hackathon: N/A