Converting multiple log-files into a hash-file in bulk

oersted commented 2 years ago

I'm looking to do parallel bulk writes.

Since log-files can only be written sequentially, a solution might be to write a log-file per concurrent process and then merge them. Merging the log-files into one can be done quite efficiently, but it might make more sense to allow converting multiple log-files into a hash-file directly at sparkey_hash_write.

It would save the extra work of merging that can only be done with limited parallelism, plus it would for instance allow to spread out the log-files among multiple disks to improve write speeds (and read speeds for creating the hash table).

If more than one log-file is passed to sparkey_hash_write, might that also allow to do part of the process in parallel? At least reading the log-files and (re-)hashing the keys.

I'd like to hear the maintainer's opinion on it.

EDIT: I see that the hash-table doesn't contain the key-values but it addresses the log-file. That means that multiple log-files (in multiple disks) might also improve general read performance.

spkrka commented 2 years ago

It sounds like a good idea, but it would require changes to the hash file format so it is not a trivial change to make. I haven't thought this fully through yet, so this may not be 100% correct.

Currently the hash header contains:

uint32_t file_identifier;
uint64_t data_end;

That would need to be extended to support multiple identifiers and data_end - one for each log file. That would also lead to having a dynamic header size, which would likely have other side effects on the code base.

Furthermore, each hash entry would need to be extended to also reference which log file the entry is found in. This would have impact on the maximum log file size. Fortunately the limit now is fairly large - 64 bits of addressing space, minus some bits if block compression is enabled.

As an example, if the block with the most entries has 1000 entries, that would imply 10 bits for that. If we have 1000 log files, that would also imply 10 bits. That would leave us with 44 bits for addressing per log file, which is about 17 terabytes so it's probably ok.

Since you mentioned having log files on separate disks, you would also need some way for the hash file to find the log files. Currently you either specify both files explicitly, or implicitly assume same directory and a specific naming convention.

All in all, I don't think this is a trivial change to make, but it could work and it could be useful. I don't have time myself to work on it, but I could likely find time to review PRs unless they grow too complicated.

oersted commented 2 years ago

Thank you for considering the proposal seriously. I might look into doing a PR, but I did manage to implement an alternate solution that is good enough for now (using lmdb), so it's no longer a priority.

spkrka commented 2 years ago

An alternative solution that would not require any code changes would be to manually shard the data into multiple sparkey files. Use a custom hash map each key-value pair to the correct shard, and use the same logic for both writing and reading.

We actually do exactly this in some of data processing workflows: https://github.com/spotify/scio/blob/main/scio-extra/src/main/scala/com/spotify/scio/extra/sparkey/instances/ShardedSparkeyReader.scala

oersted commented 2 years ago

It's a good point, thanks for the contribution. I also built such an architecture with RocksDB a while back.

The sharded version you are describing works well as long as you pre-partition the data by hash-mod, so that each process writes to a single shard. This can be quite costly for some workloads and may be imbalanced if the distribution of keys is not uniform. You could always use locking I suppose to write to all shards from all processes without pre-partitioning, with enough shards and a well-balanced key-space it could be performant.

We also had a similar alternative where each key could be in any of the shards. Each process writes all its key-values to a single shard, and then reads need to query every shard for each get. This clearly sacrifices read-speed but it's quite simple and has very good write-speed and flexibiilty.

In the case of sparkey, I think that creating one log-file per process and then concatenating them before creating the hash-file is the best solution. The concatenation can be done in a single process or, if necessary, parallelized as in a parallel merge-sort or LSM tree compaction. The proposed change of supporting multiple files wouldn't require this concatenation work, but I understand it's not trivial.

spotify / sparkey

Converting multiple log-files into a hash-file in bulk #45