Hybrid hash maps with bit-field based sequential + indexed lookup

attilaolah commented 3 years ago

Type B plot files are currently very sparse. Specifically, they seem to contain about 67% zero values (varies from file to file, but the bigger the plot file, the closer it will be to this average). This appears to be due to the fact that SHA256 is a uniform hash function, and the way the hash algorithm is designed, there will be about twice as many zero values as non-zero values.

That means, plot files can be compressed by almost 2/3, by introducing an intermediary bit field, where each bit corresponds to a value in the hash table. Zero bit means the value is zero, and therefore missing from the final hash table, which would only contain the non-zero values.

As an example, with a bit length of 24, the size of an entry (i.e. width of x and x') is 6 bytes, and there are 2^24 entries. Introducing a bit field covering all these entries would use 2^24 bits, which is 2^21 bytes, or 2 MiB. Thus, the plot file would be increased by 2 MiB but also decreased by approximately 31.5 MiB, resulting in a total file size of 66.3 MiB.

This applies similarly to plot files with different bit length, although the savings end up being the best with plot files where the bit length is a multiple of 8 (so 24, 32 and 40 are the best choices for the bit length). The result is a 65% reduction in file size, and thus required capacity, and the only trade-off is slightly more memory usage and an extra read from disk to read the bit field before accessing the hash map.

To keep things simple and backwards compatible, I'd suggest introducing a third type of table: Type C. This way existing Type B tables could also be upgraded to Type C, without having to re-compute all the hashes. (And the actual compaction of Type B to Type C tables would be fairly quick anyway.)

The rest of the code, including the protocol itself would not have to be changed.

massnetorg commented 3 years ago

Hi @attilaolah , nice idea to transform HashMapB (Hash Query) to HashMapC (Sequential Query).

It would be very useful if the workload of Query Complexity is acceptable by most PC users. (It should be acceptable by professional users with extremely fast IO)

Implementation will be added after the evaluation.

attilaolah commented 3 years ago

I was hesitant to call it "sequential lookup" since much of the lookup part itself would still be index based. But you're right, for a sizeable plot it would still require quite some reads for every challenge.

Currently we take the challenge c, seek to position 4096 + (byte_len * c * 2) and read byte_len * 2 bytes.
With the hybrid (Type C) approach, we would first seek to position 4096, then read c / 8 bytes of data, calculate the number of zero bits in the bytes that we read, subtract that from the challenge to get c1, and seek to 4096 + (2^(bit_len - 3) / byte_len) + c1 and read the usual byte_len * 2 bytes.

The sequential part of the operation here is the sequential read of the bit field header. This can be done either at every read to stay low on memory usage, or it could be kept in memory to trade off memory vs. storage, although I don't think the tradeoff is really worth it since memory requirement would be a little over 2% of the (final) capacity.

The extra memory usage would be based on bit length of course:

`bit_len`	24	32	40
Sequential Reads	2 MiB	512 MiB	128 GiB
Storage Savings	64 MiB	5.63 GiB	6.7 TiB

So I believe that the final workload complexity would be reasonably low like you said, for users with very fast IO. This might not be the case for everyday users, although for folks who just want to harvest a few GiB of space on their everyday PC it might be acceptable to keep the bit fields in memory. Hard to know this.

In any case, I'm not asking here to implement this for everyone, I just wanted to have this discussion out there, in case someone is interested in this tradeoff. Thanks for the response!

attilaolah commented 3 years ago

Alternatively, you could put the bit field outside of the plot file, in a separate index file. This index file could then be matched to the plot file based on the hash in the filename, but it could live somewhere else, e.g. on a faster SSD, while the plot files can live on slower HDDs or even network storage attached storage.

(Just thinking out loud here.)

attilaolah commented 3 years ago

On second thought, it should also be possible to index the bit-field file itself, and thus reducing the amount of data to be read to log2() of the original.

One way to do that would be to build a binary tree that pre-computes the number of zeros. For example: if challenge c makes us land on the second half of the index file, we could have the number of zeros in the first half pre-computed, that way we only need to start reading from the middle of the index file. This can then be applied recursively, halving the max amount of data to be read in each step (that's where the log2() comes from).

An alternative would be to define a limit of how much data we are willing to read from each index file. Let's say it is 32 KiB (or whatever fits nicely in the block size of the underlying file system). Then we create a small hashmap that contains the number of zeros in the first n 32 KiB chunks. When we get a challenge, we first read into the small hashmap, then we read the 32 KiB of data from the bit field, we count the number of zeros in the 32 KiB chunk up until c, subtract that from c, also subtract the result of the small hashmap lookup from c, and then use the final value to do the lookup in the plot file.

This approach would limit the amount of extra data to be read from the index file to 32 KiB. At this point, it probably no longer makes sense to split the index file into a separate plot file. This adds a little extra to the additional data to be stored to save on the reads.

Assuming we use uint32 (so 4 bytes) to store the number of zeros in the index up until the chunk to be read:

`bit_len`	24	32	40
Index Size	2 MiB + 256 B	512 MiB + 64 KiB	128 GiB + 16 MiB
Bytes Read	4 B + 32 KiB + 6 B	4 B + 32 KiB + 8 B	4 B + 32 KiB + 10 B
Storage Savings	64 MiB	5.63 GiB	6.7 TiB

massnetorg / MassNet-miner

Hybrid hash maps with bit-field based sequential + indexed lookup #2