ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.24k stars 5.62k forks source link

[Data] Add WarcDatasource for reading WARC/ARC files #45535

Open ryan-minato opened 4 months ago

ryan-minato commented 4 months ago

Description

Add a Datasource for reading data from WARC/ARC files.

Use case

In cleaning of pre-training data for LLM, Ray Data is nearly the only distributed solution (Dask appears to be less suitable for this task). However, I noticed that Ray Data don't have a convenient way to access data from Common Crawl. Adding a Datasource for reading WARC/ARC data would be helpful.

ryan-minato commented 4 months ago

I have extracted the part that reads WARC files from the newly released Datatrove framework on Huggingface and created a Datasource as a reference.

Unfortunately, I don't have the time to handle the integration and CI pipeline.

45536