marsupialtail / quokka

Making data lake work for time series
https://marsupialtail.github.io/quokka/
Apache License 2.0
1.14k stars 60 forks source link

Disk based hash joins meta thread #39

Open marsupialtail opened 1 year ago

marsupialtail commented 1 year ago

@savebuffer

Steps:

  1. Add C++ build infrastructure for Pyarrow plugins.
  2. Support disk spilling via refactoring: https://github.com/apache/arrow/pull/13669/files#diff-8099df49024baabc838e5615bbf8403232678172e089828efe631b99f8adba54
  3. Modify above to keep the hash table in memory and only keep disk offsets in memory.
  4. Write C++ plugin for random row lookups in streaming disk-based hash join
  5. Test and brag.
marsupialtail commented 1 year ago

Step 1 is done.

One more step needed is to first add bloom filters for the probe side.