TJ-Solergibert commented 8 months ago

What does this PR do?

Solves #45

The current version of Nanotron's input pipelines is based on Hugging Face Datasets and relies on clm_preprocess, which tokenises and preprocesses the entire dataset at the beginning of the training (linked to the sequence_length, making it even more difficult to reuse the result across different experiments).

I have developed new data input pipelines based on those included in Megatron. Since I didn't want Nanotron to lose its essence, I removed many functionalities that we don't need (such as those related to BERT models pretraining). What I mainly modified is the torch.utils.data.Dataset, and we continue to work with the same Sampler, Collator and DataLoader (* I had to modify them slightly), so it doesn't alter the behavior of other modules like the PipelineEngine at all. It also allows us to continue using the previous pipeline based on Hugging Face Datasets, since I added the script run_train_nanoset.py to launch the training with the new pipeline.

Relevant details:

The new input pipelines work with the same files as Megatron. In this PR, I include references to this project to carry out the data preprocessing, although we could include them here as well. The scripts could be simplified, and necessary dependencies could be added.
I included a new configuration in the .yaml file called NanosetDatasetsArgs, which can replace `PretrainDatasetsArgs. You only need to specify the path to the dataset (generated by Megatron's preprocess_data.py, without the extension as they specify) and the distribution of the dataset samples for each of the splits (train, valid, and test) so that it sums up to 1.
The Nanoset will be the new dataset format. It is a lighter version of GPTDataset and MegatronDataset from Megatron.
To build the datasets, we will use the NanosetBuilder, which, based on a NanosetConfig (Contains NanosetDatasetsArgs + other details), will build a Nanoset for each split. In this first version, we only support one dataset file, but I will include the possibility of using multiple files (BlendedNanoset), hence preserving the NanosetBuilder.
Each Nanoset contains an MMapIndexedDataset. This object is found in indexed_dataset.py and comes from fairseq. Megatron also includes it as such.
I have added a page to the documentation with more details about the preprocessing to create the datasets and the internal functioning similar to what is included in Megatron.

I think maybe we should centralize the input data pipelines and perhaps move the dataloader.py file to another location. I also propose moving several functions from this file with the comments # Question:.

To use the Nanoset datasets, you need to specify the data_path and split fields in config.data.dataset in the .yaml file and use the script run_train_nanoset.py in the same way as run_train.py.

data:
  dataset:
    data_path: /mloscratch/homes/solergib/s-ai/nanotron/datasets/llama2/europarl-gpt-llama2_text_document
    split: 949,50,1
  num_loading_workers: 0
  seed: 1234

I've published the wandb logs of the different tests I have carried out, comparing the HF Datasets and the new Nanoset datasets with 1 and 4 GPUs and resuming training from a checkpoint.

This is a first version, I am open to all suggestions you can think of! I have named it Nanoset, but as I said, you are free to change it!

Toni

NouamaneTazi commented 8 months ago

Amazing job 🚀 Feel free to request reviews once you feel the PR is ready!

TJ-Solergibert commented 8 months ago

Hello! Everything is ready now. I've added the following:

Nanoset: A new dataset inspired by Megatron.
BlendedNanoset: Similar to the previous one, but this one is for mixing different datasets. It allows specifying the weight of each dataset.
MMapIndexedDataset: Both Nanoset and BlendedNanoset contain MMapIndexedDataset, where the data is actually stored (check, as it queries the data using the get method of MMapIndexedDataset).
NanosetBuilder: Responsible for creating the train, valid, and test datasets whether they are Nanoset or BlendedNanoset type.
New NanosetDatasetsArgs to configure the new datasets. I've also added val_steps to TokensArgs to generate valid and train datasets.
Tools for:
- Preprocessing the data and generating Megatron-style datasets (tools/preprocess_data.py). Inspired by Megatron-LM/tools/preprocess_data.py, I have developed a simpler version, getting rid of several features we don't need, such as split sentences or multimodal datasets. I've also enforced the tokenizer to be built with AutoTokenizer.from_pretrained. I've kept the input format of Megatron, which is based on a .json file containing the samples to make it easier to transition to Nanotron.
- A small script to transform datasets from the Hugging Face Hub (or local) to the .json format needed by tools/preprocess_data.py (tools/hf_datasets_to_json.py).
- A tool to merge already preprocessed datasets into a single file (tools/merge_datasets.py).

In the docs, I've added instructions for using the new datasets. As you can see, it maintains compatibility with the original input data pipelines and with the rest of the project. Basically, the only thing I've modified is the torch.utils.data.Dataset, and we continue to work with practically the same collators and samplers and the same dataloader.

Some small comments and doubts that have arisen:

For using Nanosets, we need to compile src/nanotron/data/helpers.cpp. This is done during DistributedTrainer.init
In src/nanotron/dataloader.py, there is the SkipBatchSampler, EmptyInfiniteDataset, and get_dataloader_worker_init that I use in src/nanotron/data/dataloader_builder.py. Perhaps we could move them to nanotron/data/dataloader_builder.py to centralize this section of the project a bit.
In the log_rank function of src/nanotron/logging.py, I've added a try-except. I did this to allow running preprocessing scripts without initializing a distributed process group. We can leave it like this or initialize the distributed process group within the preprocessing files.
In tools/preprocess_data.py, I've added a small trick to preprocess the data without installing Nanotron. Let me know if you think it's interesting or if we should get rid of it.
To preprocess the data with the Hugging Face tokenizers, we would need to add transformers as a dependency of the project.

In short: NanosetBuilder is responsible for building the train, valid, and test datasets. Depending on whether we specify a blend of multiple datasets or not, it will create either a BlendedNanoset or a Nanoset for each split. Each Nanoset contains an MMapIndexedDataset, which is the dataset that contains and reads the bytes with the tokens from the files generated by preprocess_data.py, while the Nanoset itself contains "positions" to read from the MMapIndexedDataset for each sample. The BlendedNanoset contains one Nanoset with the samples to extract from the MMapIndexedDataset for each specified path, complying with the specified weights.

Looking forward hearing your feedback!

Toni

cc @NouamaneTazi

TJ-Solergibert commented 8 months ago

Hello!

I have simplified the dataset builder, the __getitem__ method from Nanoset, and fixed a minor bug with the parser that couldn't properly identify the NanosetDatasetsArgs.

Toni

xrsrke commented 8 months ago

@TJ-Solergibert Hi. Thanks for the fantastic PR. Would be cool if we can add some unit tests for build_nanoset_dataloader, NanosetBuilder(...).build() and BlendedNanoset!

TJ-Solergibert commented 7 months ago

Hello! I've just added the tests.

I tried to stick to the design of the other tests in the repository, but it was also the first time I was developing one 😅. Finally, I've designed a script that tests everything I've included: It starts by creating a .json file like the ones the preprocess_data.py script expects and processes it to generate the .idx and .bin files containing the tokens. Then, we verify that we can create each type of Nanosets (Nanoset and BlendedNanoset) and create the Dataloader. Finally, we check that the content of the batches in each and every process is appropriate.

In this verification (assert_batch_dataloader), we ensure that the content of each element of the batch (input_ids, input_mask, label_ids & label_mask) is exactly the same across processes within the same tensor parallel group, and also that the class of each element (distinguishing between tensors with ids, tensors with masks, and TensorPointer) is exactly the same across processes within the same data parallel group.

I have locally run the pytest on a cluster with up to 8 GPUs, and they have been satisfactory (It hurt a little to waste the GPUs, but for the development of the test, I did it by applying a small patch to be able to start the dist group with "gloo").

As always, I expect your comments!

Toni

TJ-Solergibert commented 7 months ago

Hello!

I think it's a good time for a review. I've further cleaned up the project and made sure to thoroughly document the operation of the Nanosets. Below, I provide all the organized information.

Nanoset

Nanosets are a new type of dataset for Nanotron inspired by those of Megatron. They maintain their performance by dispensing with unnecessary features and (trying to) perserving the essence of Nanotron. In essence, it's just a torch.utils.data.Dataset, so we maintain compatibility with the rest of the project by slightly modifying some aspects related to data loading, such as the collator or the DistributedSampler. Inside, each Nanoset has an MMapIndexedDataset from which we extract the tokens to build the samples. The main task of the Nanoset is to control the logic for constructing the samples from the tokens of the MMapIndexedDataset.

BlendedNanoset

The BlendedNanoset is used to create a mixture of Nanosets by specifying the weights for each of them. Essentially, the BlendedNanoset is only responsible of ensuring that the dataset indices comply with the specified blend, as the samples are extracted from the Nanosets.

NanosetDatasetArgs

To use the Nanosets, I have added a new configuration to the config .yaml file (NanosetDatasetsArgs). The user only needs to enter the dataset(s) they want to use, how they want to divide the dataset into train, valid, and test partitions, and optionally a directory to store Nanoset metadata for reusing the same configuration across different runs.

Tools

Added tools for:

Preprocessing the data and generating Megatron-style datasets (tools/preprocess_data.py). Inspired by Megatron-LM/tools/preprocess_data.py, I have developed a simpler version, getting rid of several features we don't need, such as split sentences or multimodal datasets. I've also enforced the tokenizer to be built with AutoTokenizer.from_pretrained. I've kept the input format of Megatron, which is based on a .json file containing the samples to make it easier to transition to Nanotron.
A small script to transform datasets from the Hugging Face Hub (or local) to the .json format needed by tools/preprocess_data.py (tools/hf_datasets_to_json.py).
A tool to merge already preprocessed datasets into a single file (tools/merge_datasets.py).

Docs

In the docs, I've added instructions for preprocessing the data, using the Nanosets and how do the work under the hood. I strongly recommend taking a look at the examples of how the samples are constructed to understand the operation of the Nanosets.

Short summary

In short: NanosetBuilder is responsible for building the train, valid, and test datasets. Depending on whether we specify a blend of multiple datasets or not, it will create either a BlendedNanoset or a Nanoset for each split. Each Nanoset contains an MMapIndexedDataset, which is the dataset that contains and reads the bytes with the tokens from the files generated by preprocess_data.py, while the Nanoset itself contains "positions" to read from the MMapIndexedDataset for each sample. The BlendedNanoset contains one Nanoset with the samples to extract from the MMapIndexedDataset for each specified path, complying with the specified weights.

Tests

I tried to stick to the design of the other tests in the repository, but it was also the first time I was developing one 😅. Finally, I've designed a script that tests everything I've included: It starts by creating a .json file like the ones the preprocess_data.py script expects and processes it to generate the .idx and .bin files containing the tokens. Then, we verify that we can create each type of Nanosets (Nanoset and BlendedNanoset) and create the Dataloader. Finally, we check that the content of the batches in each and every process is appropriate.

In this verification (assert_batch_dataloader), we ensure that the content of each element of the batch (input_ids, input_mask, label_ids & label_mask) is exactly the same across processes within the same tensor parallel group, and also that the class of each element (distinguishing between tensors with ids, tensors with masks, and TensorPointer) is exactly the same across processes within the same data parallel group. Also, in the case of BlendedNanoset, we verify that it is composed of more than one Nanoset and that each Nanoset has enough samples to satisfy the BlendedNanoset.

Python vs CPP

In the last commit, I replaced the functions we were using from CPP with Python functions. As expected and as seen in this notebook, the CPP functions are incredibly faster than the Python ones, even causing the training start to be blocked for minutes. In the case of the function used to create the indices of the BlendedNanoset (build_blending_indices), in Python, we would spend 44 seconds to create a dataset of 1e7 samples and over 7 minutes for 1e8 samples, while in CPP, we would only wait for 0.1 and 1 second, respectively. Regarding the function used in each Nanoset to build samples of sequence length from the documents (build_sample_idx), we spend 35 seconds if we have 1e7 documents (each entry in the json file is considered a document), while the same function in CPP would take 0.5 seconds. I believe it is incredibly worthwhile to use the ones developed in CPP.

Other comments

In src/nanotron/dataloader.py, there is the SkipBatchSampler, EmptyInfiniteDataset, and get_dataloader_worker_init that I use in src/nanotron/data/dataloader_builder.py. Perhaps we could move them to nanotron/data/dataloader_builder.py to centralize this section of the project a bit.
In the log_rank function of src/nanotron/logging.py, I've added a try-except. I did this to allow running the preprocessing scripts without initializing a distributed process group. We can leave it like this or initialize the distributed process group within the preprocessing files.
In tools/preprocess_data.py, I've added a small trick to preprocess the data without installing Nanotron. Let me know if you think it's interesting or if we should get rid of it.
To preprocess the data with the Hugging Face tokenizers, I added transformers to the requirements of the project.

That would be all, tell me what you think!

Toni

TJ-Solergibert commented 7 months ago

Hello!

I have finally polished the two remaining fronts: MMapIndexedDataset and preprocess_data.py.

I have completely redesigned the preprocessing script, simplifying it to the maximum. Now the user can use both Hugging Face Datasets and JSON files similar to those used in other projects like Megatron. Basically, once we build the Dataset, we shard it among the number of workers. Each worker will be responsible for tokenizing the text present in its shard and storing it in a numpy array. Finally, we concatenate all the arrays produced by each worker. This way, we get rid of the trick to import nanotron without installing it, revert to the original log_rank function (which we had modified because it required initializing the distributed group) but, as we are now working with Hugging Face Datasets, we need to add Datasets as a project requirement. One last note: We store the tokens with dtype = np.uint16, which limits us to using tokenizers with vocabularies < 65536. We could change it to int32, although it is true that we would double the space.
I have also redesigned the MMapIndexedDataset, the heart of the Nanosets. Now it is much simpler, as it only contains the np.memmap to read the tokens. It contains the __getitem__ method, although the important one is the get, which allows us to access an offset position of the array and extract length tokens. We don't need more for CausalLM.

And where did all the sample and document index things from the Nanoset go? Actually, we don't need them, as the only thing that interests us is to extract sequence length + 1 tokens from the mmap dataset. This way, we save a lot of time in the process of building the indices. Now we simply focus on how many tokens our mmap array has, based on the number of tokens and the sequence length we compute the total number of samples that we can generate from the array and divide them into train, valid, and test.

What we do keep to access the Nanoset indices correctly is the shuffle index and the same logic we used before to build them (Concatenating them in the train split to ensure that we have enough samples), since it is worth remembering that the 3 splits access the same MMapIndexedDataset, but at different positions.

I have updated the docs and here I have uploaded the logs with the same configuration that I published a few days ago. The only thing that changes is the number of samples per Nanoset split, as we now do the division at the token level and not at the document level. I have also added a new test to verify that we are able to reset the state of the dataloader when restarting training. I have removed the license, as we do not depend on any other project. In the end, we only rely on numpy.memmap like many other projects (Megatron, OLMo, etc.).

Toni

TJ-Solergibert commented 7 months ago

Hello!

Thank you for your comments, I will review them later. Regarding what you mentioned, you have to start the training with _run_trainnanoset.py and not _runtrain.py. If you prefer, I can merge everything.

Toni

TJ-Solergibert commented 7 months ago

Hello!

I have integrated Nanosets into run_train.py. I have added a yaml config file to run experiments. Previously, it will be necessary to download and preprocess two datasets. I have chosen yelp_review_full and HuggingFaceH4/testing_alpaca_small from the Hugging Face Hub, and we use the GPT2 Tokenizer.

We download and preprocess the datasets as follows:

python3 tools/preprocess_data.py \
       --input yelp_review_full \
       --split train \
       --output-prefix datasets/yelp_review_full \
       --pretrained-model-name-or-path gpt2 \
       --num-workers 16

python3 tools/preprocess_data.py \
       --input HuggingFaceH4/testing_alpaca_small \
       --split train \
       --column completion \
       --output-prefix datasets/testing_alpaca_small \
       --pretrained-model-name-or-path gpt2 \
       --num-workers 16

We launch the job with:

torchrun --nproc-per-node 4 run_train.py --config examples/config_nanoset.yaml

I have tested it with a setup with 4 GPUs.

Toni.

Update: I've seen that some tests have failed, 2 from test_build_nanoset_dataloader and 1 from test_recover_nanoset_dataloader. The errors aren't assertion errors but torch distributed errors as it's complaining to bind sockets. I also experienced this errors in my setup, but reducing a bit more the workers solved the issue.

I've pushed another version keeping the 2 tests separately, but reducing the quantity of parametrise configs, so we perform less tests. From what I've experienced, the tests fails due to this PyTorch error 1% of the times and always in different config. The last would be returning to a single test, as it passed the tests without any issues.

Update2: Adding back @rerun_if_address_is_in_use() decorator solves all the issues.

TJ-Solergibert commented 7 months ago

Hi!

Throughout this week, I'll check all your recommendations, thank you!

transformers and datasets are necessary for data processing (mainly transformers for the tokenizer) so I've though of creating a new flavour/extension for the Nanosets and including them to the 3d_parallelism_unit_tests.yaml workflow.

Toni

TJ-Solergibert commented 6 months ago

Don't merge this PR; I'll open a new one with a truly Nano dataset 👀

Update: Moved to #155

huggingface / nanotron

Adding memmap input data pipelines #102

What does this PR do?

Nanoset

BlendedNanoset

NanosetDatasetArgs

Tools

Docs

Short summary

Tests

Python vs CPP

Other comments