ronomon / hash-table

Fast, reliable cuckoo hash table for Node.js.
MIT License
301 stars 12 forks source link

Question: Is there a way to enumerate the keys of a given hashmap? #2

Closed jtenner closed 5 years ago

jtenner commented 5 years ago

Just wondering if this is possible. I'm not familiar with the specification for this data structure, and it's not required for my use-case. Just curious.

jorangreef commented 5 years ago

Yes, it is possible but not implemented at present. @ronomon/hash-table is designed for billions of keys, so iterating keys would need a careful design. I also want to avoid increasing the surface area of the interface.

jtenner commented 5 years ago

Totally understandable.

Just one more question. Is there a way to serialize the state of the entire hashtable, and store it to disk?

EDIT: I'm looking to use this data structure as a really performant key value system for a MUD engine I'm developing (using web assembly, so Buffer is already the key I would need to read from within wasm memory anyway.)

I need to be able to save/restore the data to disk every now and then.

If the method is grokable and you would resent adding it to api, I would not mind writing a custom method and maintaining it myself to obtain such a feature. Just need some help.

jtenner commented 5 years ago

I mean, my guess is this is not encouraged. :) Any advice?

jorangreef commented 5 years ago

Is there a way to serialize the state of the entire hashtable, and store it to disk?

Sure, you just need to use deterministic entropy when generating the TABLE used by the Tabulation Hash: https://github.com/ronomon/hash-table/blob/master/index.js#L74

If TABLE was exposed on the module's exports, you could change the entropy to be deterministic, i.e. replace the signed ints with deterministic signed ints. That way you could use a static method to patch the module externally and dynamically without having to maintain your own fork. The reason we use random entropy is to keep the underlying entropy secret in order to prevent hash flooding attacks. If you make your entropy deterministic, you should probably also still keep it secret.

And then you would need to iterate the HashTable's Tables to write each Table's buffer to disk, and then replace the HashTable's Table buffers when loading the buffers back from disk:

https://github.com/ronomon/hash-table/blob/master/index.js#L126

Perhaps give this a shot first by modifying a fork yourself and let me know how it goes.

jorangreef commented 5 years ago

By the way, saving whole buffers would also be much faster than iterating keys.

Something else you could look into is to use a Bitcask design. i.e. Only use the hash table as a temporary record of where things are on disk. You scan your files at startup, and populate the hash table with offsets of everything. When you update records, you first write them to disk, then you update your hash table with the new offset. Occasionally, you compact your files on disk. That way, you don't have to persist the hash table, you can crash at anytime.

A Bitcask design would work really well for up to a 100 million items, at which point your startup parsing time would be around a minute. Thereafter, you could look into an LSM design so that you don't have to parse offsets at startup.

jtenner commented 5 years ago

Okay. First of all, the performance on this HashMap implementation is totally bonkers.

Second, the fork suggestion makes sense. I gathered that the TABLE would need to be deterministic. Good news is that can be configured. I plan on making each hash map have it's own table and get stored along with the data.

Questions:

Is the number of Buffers that need to be copied the same for every HashMap?

Will I have to store the number of Buffers along with the data (and as a consequence construct additional buffers manually on load?)

What other data needs to be stored if any to make sure the HashMap initializes correctly?

Finally, Thank you for this amazing software.

jorangreef commented 5 years ago

Is the number of Buffers that need to be copied the same for every HashMap?

Yes, provided all four arguments passed here are exactly the same, and providing you are using the same major version of this module: https://github.com/ronomon/hash-table/blob/master/index.js#L84

You can understand how elementsMin and elementsMax feed into the number of buffers here: https://github.com/ronomon/hash-table/blob/master/index.js#L100-L102

You should probably store the number of buffers along with the data, only so that you can assert to cross-check this for extra safety after instantiating the HashTable and right before replacing its buffers. The number of buffers though should never change, but at least you could pick it up if it does.

Bear in mind that the buffers for each HashTable's Table shard can be of different size, they are resized independently. e.g. A HashTable could have between HashTable.BUFFERS_MIN (1) and HashTable.BUFFERS_MAX (8192) Table shards, and each of these would have different sized buffers.

Another safety check you could do is to use a random sentinel key that you store separately, and assert that this exists after loading from disk. It's a fast run-time check. Using random sentinels for every HashMap would give you slightly more coverage than a static sentinel.

Finally, Thank you for this amazing software.

It's a pleasure. Let me know what you build with it and keep me posted how it goes.

jtenner commented 5 years ago

Plan

So my plan is to make an entire mud engine with assemblyscript compiled plugins that are installed via npm. These plugins are essentially written in TypeScript and then compiled (and sandboxed) inside of a WebAssembly module. So all the examples I will show you are written in TypeScript that executes like linear bytecode. The goal is to get something working so that developers can run a few npm commands and get a working MUD server with sandboxed plugins very quickly.

I plan on shipping the AssemblyScript compiler with the software as a CLI.

The goal of using this hash-table algorithm is to take advantage of the fact that WebAssembly has a linear memory model backed by a single ArrayBuffer. What's more, when you return a reference from within web assembly, it's already guaranteed to be an i32 value which is treated as a pointer into that ArrayBuffer.

This is what a plugin developer will call from within their plugin to make sure that their table exists.

class T {}
class U {}

// returns an object that points to a JavaScript hosted hashtable
let statsTable = plugin.ensureTable<T, U>("stats");

The AssemblyScript compiler can actually calculate the size of T and U at compile time. We can assert that offsetof<T>() < KEY_SIZE_MAX and offsetof<U>() < VALUE_SIZE_MAX when the module itself is compiled at runtime to WebAssembly. This will help us emit useful compile time errors for developers who want to test their plugin on a live server.

One last point about T, if the size of T is not a multiple of 4, we can simply store a single temp buffer with a proper size key to get the key data out of WebAssembly every time we need to do any kind of operation on the data.

At this point, this software should check the filesystem for a database that is located at ./__data/{plugin-name}/{table}/. Each shard itself can have it's own file located at ./__data/{plugin-name}/{table}/{shard-index}.bin and each table will have a meta.json file associated with it in the same folder. The JSON file will store all the metadata about each shard, including shard size and a hash to verify shard integrity at load time. This file should get updated every time a save occurs, calculating a new hash for each shard that changed.

Questions

  1. Is there a good way to keep track of each shard that has been modified so I only have to open file descriptors for ones that were modified?

  2. For storage performance, how should I instruct users to use the cache() vs the set() function? I know they should be used in a mutual exclusive manner, but it would be nice to have documentation about it so people don't get confused. My target developer base is a lowest common denominator JavaScript developer (not that this is a bad thing of course.)

If I have any more questions I'm sure I'll post again. Thank you for your help.

jtenner commented 5 years ago

Initial Progress

Here's what I have so far: https://github.com/jtenner/morp/tree/master/packages/hash-table

I've nested the implementation in a lerna repo along with the mud engine for later. I've also ported the source code to TypeScript as a learning experience. I found out a lot about how it works. (And despite my silly choices) It's still very fast and passes the tests locally on my laptop.

Still todo:

Also, I noticed a variable that is unused in one of the functions:

  cache(
    h1: number,
    h2: number,
    key: Buffer,
    keyOffset: number,
    value: Buffer,
    valueOffset: number,
  ): 0 | 1 | 2 {
    // See comments in set():
    var tag = (h1 >> 16) & 255;
    var b1 = (h1 & this.mask) * this.bucket;
    var b2 = (h2 & this.mask) * this.bucket;

My ide is reporting that b2 is unused here. Is that correct?

jtenner commented 5 years ago

Progress

I was successful in developing an example of how I'm going to reload the HashTable from disk. Would you mind taking a look at the example I have here?

https://github.com/jtenner/morp/blob/master/packages/hash-table/test/reload.ts

This file shows how it would be done. I just want to make sure I haven't forgotten anything!

Edit:

image

Here is an example of the file structure I developed. Please see HashTable.fromFolder(folder: string): HashTable for the function that loads the data from disk.

jorangreef commented 5 years ago

So my plan is to make an entire mud engine with assemblyscript compiled plugins that are installed via npm. These plugins are essentially written in TypeScript and then compiled (and sandboxed) inside of a WebAssembly module. So all the examples I will show you are written in TypeScript that executes like linear bytecode. The goal is to get something working so that developers can run a few npm commands and get a working MUD server with sandboxed plugins very quickly.

That's very cool.

The goal of using this hash-table algorithm is to take advantage of the fact that WebAssembly has a linear memory model backed by a single ArrayBuffer.

Yes, @ronomon/hash-table was designed to eliminate pointers and make use of large slabs of memory, so it's a good fit here.

Is there a good way to keep track of each shard that has been modified so I only have to open file descriptors for ones that were modified?

You could patch the non-read methods to mark slabs as dirty, using a dirty bit.

For storage performance, how should I instruct users to use the cache() vs the set() function? I know they should be used in a mutual exclusive manner, but it would be nice to have documentation about it so people don't get confused. My target developer base is a lowest common denominator JavaScript developer (not that this is a bad thing of course.)

Just don't mention the existence of cache().

Also, I noticed a variable that is unused in one of the functions:

Thanks, I will issue a patch.

Would you mind taking a look at the example I have here?

Looks good, I would just make elementsMin and elementsMax explicit here (these are optional defaults but like we said before they change the number of buffers): https://github.com/jtenner/morp/blob/master/packages/hash-table/test/reload.ts#L12

You could also extend the test with a loop to test through multiple rounds of saves/restores, checking at each restore, and right at the end.

jtenner commented 5 years ago

The reason why I left the min and max parameters alone is because I'm not sure what reasonable initialization options might be. Thanks for taking a look! It was a huge help.

jorangreef commented 5 years ago

The reason why I left the min and max parameters alone is because I'm not sure what reasonable initialization options might be.

You could set elementsMin to 1024 and elementsMax to 4194304.

Thanks for taking a look! It was a huge help.

Sure, it's a pleasure!