sambanova / generative_data_prep

Apache License 2.0
58 stars 7 forks source link

Progress Bar #36

Closed snova-zoltanc closed 8 months ago

snova-zoltanc commented 1 year ago

Tokenizing huge datasets can take a long time, it is very frustrating that we can not see the progress or if anything is even happening.

This is not trivial to implement because the tokenization is happening with multi processed workers, but with shared memory this can be done.

snova-zoltanc commented 8 months ago

Completed with https://github.com/sambanova/generative_data_prep/pull/52