viking-sudo-rm / rusty-dawg

Rust library for indexing and quickly searching large pretraining corpora
https://arxiv.org/abs/2406.13069
MIT License
17 stars 2 forks source link

Optimize traversal for computing node arities #114

Closed viking-sudo-rm closed 4 months ago

viking-sudo-rm commented 4 months ago

This is a feature required to collect data for design of #99.

The original Python implementation was very slow (would take ~80 days).

This is way to slow, especially because cdawg.node_count(), which implements very similar logic, takes just a couple minutes. To make it faster, could make the following improvements:

viking-sudo-rm commented 4 months ago

Implemented in 3214527.

This sped up the traversal of the Pile CDAWG compared to the Python implementation, but it is still slow (estimated by TQDM to take 400 hours). Randomly accessing disk seems to be a real bottleneck.