Open ryan-minato opened 4 months ago
I have extracted the part that reads WARC files from the newly released Datatrove framework on Huggingface and created a Datasource as a reference.
Unfortunately, I don't have the time to handle the integration and CI pipeline.
Description
Add a Datasource for reading data from WARC/ARC files.
Use case
In cleaning of pre-training data for LLM, Ray Data is nearly the only distributed solution (Dask appears to be less suitable for this task). However, I noticed that Ray Data don't have a convenient way to access data from Common Crawl. Adding a Datasource for reading WARC/ARC data would be helpful.