huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.23k stars 122 forks source link

Adding memmap input data pipelines #102

Closed TJ-Solergibert closed 6 months ago

TJ-Solergibert commented 8 months ago

What does this PR do?

Solves #45

The current version of Nanotron's input pipelines is based on Hugging Face Datasets and relies on clm_preprocess, which tokenises and preprocesses the entire dataset at the beginning of the training (linked to the sequence_length, making it even more difficult to reuse the result across different experiments).

I have developed new data input pipelines based on those included in Megatron. Since I didn't want Nanotron to lose its essence, I removed many functionalities that we don't need (such as those related to BERT models pretraining). What I mainly modified is the torch.utils.data.Dataset, and we continue to work with the same Sampler, Collator and DataLoader (* I had to modify them slightly), so it doesn't alter the behavior of other modules like the PipelineEngine at all. It also allows us to continue using the previous pipeline based on Hugging Face Datasets, since I added the script run_train_nanoset.py to launch the training with the new pipeline.

Relevant details:

I think maybe we should centralize the input data pipelines and perhaps move the dataloader.py file to another location. I also propose moving several functions from this file with the comments # Question:.

To use the Nanoset datasets, you need to specify the data_path and split fields in config.data.dataset in the .yaml file and use the script run_train_nanoset.py in the same way as run_train.py.

data:
  dataset:
    data_path: /mloscratch/homes/solergib/s-ai/nanotron/datasets/llama2/europarl-gpt-llama2_text_document
    split: 949,50,1
  num_loading_workers: 0
  seed: 1234

I've published the wandb logs of the different tests I have carried out, comparing the HF Datasets and the new Nanoset datasets with 1 and 4 GPUs and resuming training from a checkpoint.

This is a first version, I am open to all suggestions you can think of! I have named it Nanoset, but as I said, you are free to change it!

Toni

NouamaneTazi commented 8 months ago

Amazing job ๐Ÿš€ Feel free to request reviews once you feel the PR is ready!

TJ-Solergibert commented 8 months ago

Hello! Everything is ready now. I've added the following:

In the docs, I've added instructions for using the new datasets. As you can see, it maintains compatibility with the original input data pipelines and with the rest of the project. Basically, the only thing I've modified is the torch.utils.data.Dataset, and we continue to work with practically the same collators and samplers and the same dataloader.

Some small comments and doubts that have arisen:

In short: NanosetBuilder is responsible for building the train, valid, and test datasets. Depending on whether we specify a blend of multiple datasets or not, it will create either a BlendedNanoset or a Nanoset for each split. Each Nanoset contains an MMapIndexedDataset, which is the dataset that contains and reads the bytes with the tokens from the files generated by preprocess_data.py, while the Nanoset itself contains "positions" to read from the MMapIndexedDataset for each sample. The BlendedNanoset contains one Nanoset with the samples to extract from the MMapIndexedDataset for each specified path, complying with the specified weights.

Looking forward hearing your feedback!

Toni

cc @NouamaneTazi

TJ-Solergibert commented 8 months ago

Hello!

I have simplified the dataset builder, the __getitem__ method from Nanoset, and fixed a minor bug with the parser that couldn't properly identify the NanosetDatasetsArgs.

Toni

xrsrke commented 8 months ago

@TJ-Solergibert Hi. Thanks for the fantastic PR. Would be cool if we can add some unit tests for build_nanoset_dataloader, NanosetBuilder(...).build() and BlendedNanoset!

TJ-Solergibert commented 7 months ago

Hello! I've just added the tests.

I tried to stick to the design of the other tests in the repository, but it was also the first time I was developing one ๐Ÿ˜…. Finally, I've designed a script that tests everything I've included: It starts by creating a .json file like the ones the preprocess_data.py script expects and processes it to generate the .idx and .bin files containing the tokens. Then, we verify that we can create each type of Nanosets (Nanoset and BlendedNanoset) and create the Dataloader. Finally, we check that the content of the batches in each and every process is appropriate.

In this verification (assert_batch_dataloader), we ensure that the content of each element of the batch (input_ids, input_mask, label_ids & label_mask) is exactly the same across processes within the same tensor parallel group, and also that the class of each element (distinguishing between tensors with ids, tensors with masks, and TensorPointer) is exactly the same across processes within the same data parallel group.

I have locally run the pytest on a cluster with up to 8 GPUs, and they have been satisfactory (It hurt a little to waste the GPUs, but for the development of the test, I did it by applying a small patch to be able to start the dist group with "gloo").

As always, I expect your comments!

Toni

TJ-Solergibert commented 7 months ago

Hello!

I think it's a good time for a review. I've further cleaned up the project and made sure to thoroughly document the operation of the Nanosets. Below, I provide all the organized information.

Nanoset

Nanosets are a new type of dataset for Nanotron inspired by those of Megatron. They maintain their performance by dispensing with unnecessary features and (trying to) perserving the essence of Nanotron. In essence, it's just a torch.utils.data.Dataset, so we maintain compatibility with the rest of the project by slightly modifying some aspects related to data loading, such as the collator or the DistributedSampler. Inside, each Nanoset has an MMapIndexedDataset from which we extract the tokens to build the samples. The main task of the Nanoset is to control the logic for constructing the samples from the tokens of the MMapIndexedDataset.

BlendedNanoset

The BlendedNanoset is used to create a mixture of Nanosets by specifying the weights for each of them. Essentially, the BlendedNanoset is only responsible of ensuring that the dataset indices comply with the specified blend, as the samples are extracted from the Nanosets.

NanosetDatasetArgs

To use the Nanosets, I have added a new configuration to the config .yaml file (NanosetDatasetsArgs). The user only needs to enter the dataset(s) they want to use, how they want to divide the dataset into train, valid, and test partitions, and optionally a directory to store Nanoset metadata for reusing the same configuration across different runs.

Tools

Added tools for:

Docs

In the docs, I've added instructions for preprocessing the data, using the Nanosets and how do the work under the hood. I strongly recommend taking a look at the examples of how the samples are constructed to understand the operation of the Nanosets.

Short summary

In short: NanosetBuilder is responsible for building the train, valid, and test datasets. Depending on whether we specify a blend of multiple datasets or not, it will create either a BlendedNanoset or a Nanoset for each split. Each Nanoset contains an MMapIndexedDataset, which is the dataset that contains and reads the bytes with the tokens from the files generated by preprocess_data.py, while the Nanoset itself contains "positions" to read from the MMapIndexedDataset for each sample. The BlendedNanoset contains one Nanoset with the samples to extract from the MMapIndexedDataset for each specified path, complying with the specified weights.

Tests

I tried to stick to the design of the other tests in the repository, but it was also the first time I was developing one ๐Ÿ˜…. Finally, I've designed a script that tests everything I've included: It starts by creating a .json file like the ones the preprocess_data.py script expects and processes it to generate the .idx and .bin files containing the tokens. Then, we verify that we can create each type of Nanosets (Nanoset and BlendedNanoset) and create the Dataloader. Finally, we check that the content of the batches in each and every process is appropriate.

In this verification (assert_batch_dataloader), we ensure that the content of each element of the batch (input_ids, input_mask, label_ids & label_mask) is exactly the same across processes within the same tensor parallel group, and also that the class of each element (distinguishing between tensors with ids, tensors with masks, and TensorPointer) is exactly the same across processes within the same data parallel group. Also, in the case of BlendedNanoset, we verify that it is composed of more than one Nanoset and that each Nanoset has enough samples to satisfy the BlendedNanoset.

Python vs CPP

In the last commit, I replaced the functions we were using from CPP with Python functions. As expected and as seen in this notebook, the CPP functions are incredibly faster than the Python ones, even causing the training start to be blocked for minutes. In the case of the function used to create the indices of the BlendedNanoset (build_blending_indices), in Python, we would spend 44 seconds to create a dataset of 1e7 samples and over 7 minutes for 1e8 samples, while in CPP, we would only wait for 0.1 and 1 second, respectively. Regarding the function used in each Nanoset to build samples of sequence length from the documents (build_sample_idx), we spend 35 seconds if we have 1e7 documents (each entry in the json file is considered a document), while the same function in CPP would take 0.5 seconds. I believe it is incredibly worthwhile to use the ones developed in CPP.

Other comments

That would be all, tell me what you think!

Toni

TJ-Solergibert commented 7 months ago

Hello!

I have finally polished the two remaining fronts: MMapIndexedDataset and preprocess_data.py.

And where did all the sample and document index things from the Nanoset go? Actually, we don't need them, as the only thing that interests us is to extract sequence length + 1 tokens from the mmap dataset. This way, we save a lot of time in the process of building the indices. Now we simply focus on how many tokens our mmap array has, based on the number of tokens and the sequence length we compute the total number of samples that we can generate from the array and divide them into train, valid, and test.

What we do keep to access the Nanoset indices correctly is the shuffle index and the same logic we used before to build them (Concatenating them in the train split to ensure that we have enough samples), since it is worth remembering that the 3 splits access the same MMapIndexedDataset, but at different positions.

I have updated the docs and here I have uploaded the logs with the same configuration that I published a few days ago. The only thing that changes is the number of samples per Nanoset split, as we now do the division at the token level and not at the document level. I have also added a new test to verify that we are able to reset the state of the dataloader when restarting training. I have removed the license, as we do not depend on any other project. In the end, we only rely on numpy.memmap like many other projects (Megatron, OLMo, etc.).

Toni

TJ-Solergibert commented 7 months ago

Hello!

Thank you for your comments, I will review them later. Regarding what you mentioned, you have to start the training with _run_trainnanoset.py and not _runtrain.py. If you prefer, I can merge everything.

Toni

TJ-Solergibert commented 7 months ago

Hello!

I have integrated Nanosets into run_train.py. I have added a yaml config file to run experiments. Previously, it will be necessary to download and preprocess two datasets. I have chosen yelp_review_full and HuggingFaceH4/testing_alpaca_small from the Hugging Face Hub, and we use the GPT2 Tokenizer.

We download and preprocess the datasets as follows:

python3 tools/preprocess_data.py \
       --input yelp_review_full \
       --split train \
       --output-prefix datasets/yelp_review_full \
       --pretrained-model-name-or-path gpt2 \
       --num-workers 16
python3 tools/preprocess_data.py \
       --input HuggingFaceH4/testing_alpaca_small \
       --split train \
       --column completion \
       --output-prefix datasets/testing_alpaca_small \
       --pretrained-model-name-or-path gpt2 \
       --num-workers 16

We launch the job with:

torchrun --nproc-per-node 4 run_train.py --config examples/config_nanoset.yaml

I have tested it with a setup with 4 GPUs.

Toni.

Update: I've seen that some tests have failed, 2 from test_build_nanoset_dataloader and 1 from test_recover_nanoset_dataloader. The errors aren't assertion errors but torch distributed errors as it's complaining to bind sockets. I also experienced this errors in my setup, but reducing a bit more the workers solved the issue.

I've pushed another version keeping the 2 tests separately, but reducing the quantity of parametrise configs, so we perform less tests. From what I've experienced, the tests fails due to this PyTorch error 1% of the times and always in different config. The last would be returning to a single test, as it passed the tests without any issues.

Update2: Adding back @rerun_if_address_is_in_use() decorator solves all the issues.

TJ-Solergibert commented 7 months ago

Hi!

Throughout this week, I'll check all your recommendations, thank you!

transformers and datasets are necessary for data processing (mainly transformers for the tokenizer) so I've though of creating a new flavour/extension for the Nanosets and including them to the 3d_parallelism_unit_tests.yaml workflow.

Toni

TJ-Solergibert commented 6 months ago

Don't merge this PR; I'll open a new one with a truly Nano dataset ๐Ÿ‘€

Update: Moved to #155