man-group / ArcticDB

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
http://arcticdb.io
Other
1.52k stars 93 forks source link

Enhancement 1895: Fully parallelise processing in read_batch #1950

Closed alexowens90 closed 2 weeks ago

alexowens90 commented 3 weeks ago

Reference Issues/PRs

Closes #1895 Fixes https://github.com/man-group/arcticdb-man/issues/171 Fixes #1936 Fixes #1939 Fixes #1940

What does this implement or fix?

Schedules all work asynchronously in batch reads when processing is involved, as well as when all symbols are being read directly. Previously, symbols were processed sequentially, leading to idle CPUs when processing lots of smaller symbols.

This works by making read_frame_for_version schedule work and return futures, rather than actually performing the processing. This implementation can then be used for all 4 combinations of batch/non-batch and direct/with processing reads, significantly simplifying the code and removing the now redundant async_read_direct (the fact that there were two different implementations to achieve effectively the same thing is what led to 2 of the bugs in the first place).

Several bugs that were discovered during the implementation (flagged above) have also been fixed.

Further work in this area covered in #1968