Closed snova-zoltanc closed 8 months ago
Tokenizing huge datasets can take a long time, it is very frustrating that we can not see the progress or if anything is even happening.
This is not trivial to implement because the tokenization is happening with multi processed workers, but with shared memory this can be done.
Completed with https://github.com/sambanova/generative_data_prep/pull/52
Tokenizing huge datasets can take a long time, it is very frustrating that we can not see the progress or if anything is even happening.
This is not trivial to implement because the tokenization is happening with multi processed workers, but with shared memory this can be done.