pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
1.29k stars 115 forks source link

unify data loading from HF and from disk #287

Closed tianyu-l closed 2 months ago

tianyu-l commented 2 months ago

Stack from ghstack (oldest at bottom):

As titled. We can just use the load_dataset HF API to unify different use cases.

  1. load_dataset is flexible in that, it can take a HF hub dataset repository or a local directory. The behavior is consistent as long as the underlying data is the same. It supports common data formats such as .txt, .json, .json.gz, .csv, .parquet, etc.
  2. According to this post,

    load_dataset works in three steps: download the dataset, then prepare it as an arrow dataset, and finally return a memory mapped arrow dataset. In particular it creates a cache directory to store the arrow data and the subsequent cache files for map.

  3. Previously used load_from_disk can only load dataset saved by save_to_disk (in arrow format), which can be viewed as a way to load "preprocessed" dataset:

    load_from_disk directly returns a memory mapped dataset from the arrow file (similar to Dataset.from_file). It doesn't create a cache diretory, instead all the subsequent map calls write in the same directory as the original data.

  4. For large dataset (which cannot fit in memory), we need to set streaming=True for load_dataset, even if it is stored in a local directory. One might think load_from_diskis better because of point 3 above; however, to preprocess the huge dataset and call save_to_disk, one needs to load it in memory in the first place.

For all the reasons listed above, let's not use load_from_disk which assumes preprocessed data in arrow format.

Let's use load_dataset which supports common data formats, and set streaming=True for large dataset, no matter it is from HF or from local disk.

P.S.:

  1. This PR updates the data file from arrow to json, while keeping the same data (first 45,000 entries of c4).
  2. c4 is now available to run large scale experiments. Performance verified.