marsupialtail / quokka

Making data lake work for time series
https://marsupialtail.github.io/quokka/
Apache License 2.0
1.1k stars 60 forks source link

reduce initialization overhead - meta thread #12

Closed marsupialtail closed 1 year ago

marsupialtail commented 1 year ago

Quokka currently has very high initialization overhead for launching input reader actors.

This should go away when we migrate towards an architecture where we only have one actor per machine, but before that happens it would be good to have some sort of way to reduce this initialization overhead.

This is similar to Spark's optimization of moving input partitioning to parallel across the workers instead of on the coordinator in the olden days.

The problem is mainly two fold:

Our architecture change should take care of the second one but won't solve the first one. We should probably work on optimizing parallelizing this process, and have the actual worker nodes, who should be alive when all this is happening anyways, to do a lot of this work, and communicate the results back to the master with ray object store or something.

marsupialtail commented 1 year ago

That said Quokka is still useful for jobs that are very long. For jobs that are very long this init overhead (< 1 min) becomes insignificant.

marsupialtail commented 1 year ago

E.g. this is executing locally on a Parquet dataset that has ~500MB: Parquet dataset at /home/ziheng/tpc-h/lineitem.parquet has total 6001215 rows actor spin up took 4.991477727890015 init time 7.50970721244812 run time 1.1990654468536377

marsupialtail commented 1 year ago

Initialization time has drastically improved. However still need to make sure we don't relaunch TaskManagers every time a new TaskGraph is instantiated -- TaskManagers should be tied with a QuokkaContext, not a TaskGraph.

marsupialtail commented 1 year ago

This just got much more important due to way Quokka executes multi-stage queries....

marsupialtail commented 1 year ago

https://github.com/marsupialtail/quokka/commit/fa335cf6c755c705e968cfed309ab1e2c92c8c53