Closed TJ-Solergibert closed 6 months ago
Amazing job ๐ Feel free to request reviews once you feel the PR is ready!
Hello! Everything is ready now. I've added the following:
Nanoset
: A new dataset inspired by Megatron.BlendedNanoset
: Similar to the previous one, but this one is for mixing different datasets. It allows specifying the weight of each dataset.MMapIndexedDataset
: Both Nanoset
and BlendedNanoset
contain MMapIndexedDataset
, where the data is actually stored (check, as it queries the data using the get
method of MMapIndexedDataset
).NanosetBuilder
: Responsible for creating the train, valid, and test datasets whether they are Nanoset
or BlendedNanoset
type.NanosetDatasetsArgs
to configure the new datasets. I've also added val_steps
to TokensArgs
to generate valid and train datasets.tools/preprocess_data.py
). Inspired by Megatron-LM/tools/preprocess_data.py
, I have developed a simpler version, getting rid of several features we don't need, such as split sentences or multimodal datasets. I've also enforced the tokenizer to be built with AutoTokenizer.from_pretrained
. I've kept the input format of Megatron, which is based on a .json file containing the samples to make it easier to transition to Nanotron. tools/preprocess_data.py
(tools/hf_datasets_to_json.py
). tools/merge_datasets.py
).In the docs
, I've added instructions for using the new datasets. As you can see, it maintains compatibility with the original input data pipelines and with the rest of the project. Basically, the only thing I've modified is the torch.utils.data.Dataset
, and we continue to work with practically the same collators and samplers and the same dataloader.
Some small comments and doubts that have arisen:
src/nanotron/data/helpers.cpp
. This is done during DistributedTrainer.init
src/nanotron/dataloader.py
, there is the SkipBatchSampler
, EmptyInfiniteDataset
, and get_dataloader_worker_init
that I use in src/nanotron/data/dataloader_builder.py
. Perhaps we could move them to nanotron/data/dataloader_builder.py
to centralize this section of the project a bit.log_rank
function of src/nanotron/logging.py
, I've added a try-except. I did this to allow running preprocessing scripts without initializing a distributed process group. We can leave it like this or initialize the distributed process group within the preprocessing files.tools/preprocess_data.py
, I've added a small trick to preprocess the data without installing Nanotron. Let me know if you think it's interesting or if we should get rid of it.In short: NanosetBuilder
is responsible for building the train, valid, and test datasets. Depending on whether we specify a blend of multiple datasets or not, it will create either a BlendedNanoset
or a Nanoset
for each split. Each Nanoset
contains an MMapIndexedDataset
, which is the dataset that contains and reads the bytes with the tokens from the files generated by preprocess_data.py
, while the Nanoset
itself contains "positions" to read from the MMapIndexedDataset
for each sample. The BlendedNanoset
contains one Nanoset
with the samples to extract from the MMapIndexedDataset
for each specified path, complying with the specified weights.
Looking forward hearing your feedback!
Toni
cc @NouamaneTazi
Hello!
I have simplified the dataset builder, the __getitem__
method from Nanoset, and fixed a minor bug with the parser that couldn't properly identify the NanosetDatasetsArgs
.
Toni
@TJ-Solergibert Hi. Thanks for the fantastic PR. Would be cool if we can add some unit tests for build_nanoset_dataloader
, NanosetBuilder(...).build()
and BlendedNanoset
!
Hello! I've just added the tests.
I tried to stick to the design of the other tests in the repository, but it was also the first time I was developing one ๐
. Finally, I've designed a script that tests everything I've included: It starts by creating a .json
file like the ones the preprocess_data.py
script expects and processes it to generate the .idx
and .bin
files containing the tokens. Then, we verify that we can create each type of Nanosets
(Nanoset
and BlendedNanoset
) and create the Dataloader
. Finally, we check that the content of the batches in each and every process is appropriate.
In this verification (assert_batch_dataloader
), we ensure that the content of each element of the batch (input_ids
, input_mask
, label_ids
& label_mask
) is exactly the same across processes within the same tensor parallel group, and also that the class of each element (distinguishing between tensors with ids, tensors with masks, and TensorPointer
) is exactly the same across processes within the same data parallel group.
I have locally run the pytest on a cluster with up to 8 GPUs, and they have been satisfactory (It hurt a little to waste the GPUs, but for the development of the test, I did it by applying a small patch to be able to start the dist group with "gloo").
As always, I expect your comments!
Toni
Hello!
I think it's a good time for a review. I've further cleaned up the project and made sure to thoroughly document the operation of the Nanosets. Below, I provide all the organized information.
Nanosets
are a new type of dataset for Nanotron inspired by those of Megatron. They maintain their performance by dispensing with unnecessary features and (trying to) perserving the essence of Nanotron. In essence, it's just a torch.utils.data.Dataset
, so we maintain compatibility with the rest of the project by slightly modifying some aspects related to data loading, such as the collator or the DistributedSampler. Inside, each Nanoset has an MMapIndexedDataset
from which we extract the tokens to build the samples. The main task of the Nanoset
is to control the logic for constructing the samples from the tokens of the MMapIndexedDataset
.
The BlendedNanoset
is used to create a mixture of Nanosets
by specifying the weights for each of them. Essentially, the BlendedNanoset
is only responsible of ensuring that the dataset indices comply with the specified blend, as the samples are extracted from the Nanosets
.
To use the Nanosets, I have added a new configuration to the config .yaml file (NanosetDatasetsArgs
). The user only needs to enter the dataset(s) they want to use, how they want to divide the dataset into train, valid, and test partitions, and optionally a directory to store Nanoset
metadata for reusing the same configuration across different runs.
Added tools for:
tools/preprocess_data.py
). Inspired by Megatron-LM/tools/preprocess_data.py
, I have developed a simpler version, getting rid of several features we don't need, such as split sentences or multimodal datasets. I've also enforced the tokenizer to be built with AutoTokenizer.from_pretrained
. I've kept the input format of Megatron, which is based on a .json file containing the samples to make it easier to transition to Nanotron. tools/preprocess_data.py
(tools/hf_datasets_to_json.py
). tools/merge_datasets.py
).In the docs
, I've added instructions for preprocessing the data, using the Nanosets and how do the work under the hood. I strongly recommend taking a look at the examples of how the samples are constructed to understand the operation of the Nanosets.
In short: NanosetBuilder
is responsible for building the train, valid, and test datasets. Depending on whether we specify a blend of multiple datasets or not, it will create either a BlendedNanoset
or a Nanoset
for each split. Each Nanoset
contains an MMapIndexedDataset
, which is the dataset that contains and reads the bytes with the tokens from the files generated by preprocess_data.py
, while the Nanoset
itself contains "positions" to read from the MMapIndexedDataset
for each sample. The BlendedNanoset
contains one Nanoset
with the samples to extract from the MMapIndexedDataset
for each specified path, complying with the specified weights.
I tried to stick to the design of the other tests in the repository, but it was also the first time I was developing one ๐
. Finally, I've designed a script that tests everything I've included: It starts by creating a .json
file like the ones the preprocess_data.py
script expects and processes it to generate the .idx
and .bin
files containing the tokens. Then, we verify that we can create each type of Nanosets
(Nanoset
and BlendedNanoset
) and create the Dataloader
. Finally, we check that the content of the batches in each and every process is appropriate.
In this verification (assert_batch_dataloader
), we ensure that the content of each element of the batch (input_ids
, input_mask
, label_ids
& label_mask
) is exactly the same across processes within the same tensor parallel group, and also that the class of each element (distinguishing between tensors with ids, tensors with masks, and TensorPointer
) is exactly the same across processes within the same data parallel group. Also, in the case of BlendedNanoset
, we verify that it is composed of more than one Nanoset
and that each Nanoset
has enough samples to satisfy the BlendedNanoset
.
In the last commit, I replaced the functions we were using from CPP with Python functions. As expected and as seen in this notebook, the CPP functions are incredibly faster than the Python ones, even causing the training start to be blocked for minutes. In the case of the function used to create the indices of the BlendedNanoset
(build_blending_indices
), in Python, we would spend 44 seconds to create a dataset of 1e7 samples and over 7 minutes for 1e8 samples, while in CPP, we would only wait for 0.1 and 1 second, respectively. Regarding the function used in each Nanoset
to build samples of sequence length from the documents (build_sample_idx
), we spend 35 seconds if we have 1e7 documents (each entry in the json file is considered a document), while the same function in CPP would take 0.5 seconds. I believe it is incredibly worthwhile to use the ones developed in CPP.
src/nanotron/dataloader.py
, there is the SkipBatchSampler
, EmptyInfiniteDataset
, and get_dataloader_worker_init
that I use in src/nanotron/data/dataloader_builder.py
. Perhaps we could move them to nanotron/data/dataloader_builder.py
to centralize this section of the project a bit.log_rank
function of src/nanotron/logging.py
, I've added a try-except. I did this to allow running the preprocessing scripts without initializing a distributed process group. We can leave it like this or initialize the distributed process group within the preprocessing files.tools/preprocess_data.py
, I've added a small trick to preprocess the data without installing Nanotron. Let me know if you think it's interesting or if we should get rid of it.transformers
to the requirements of the project.That would be all, tell me what you think!
Toni
Hello!
I have finally polished the two remaining fronts: MMapIndexedDataset
and preprocess_data.py
.
I have completely redesigned the preprocessing script, simplifying it to the maximum. Now the user can use both Hugging Face Datasets and JSON files similar to those used in other projects like Megatron. Basically, once we build the Dataset, we shard it among the number of workers. Each worker will be responsible for tokenizing the text present in its shard and storing it in a numpy array. Finally, we concatenate all the arrays produced by each worker. This way, we get rid of the trick to import nanotron without installing it, revert to the original log_rank
function (which we had modified because it required initializing the distributed group) but, as we are now working with Hugging Face Datasets, we need to add Datasets as a project requirement. One last note: We store the tokens with dtype = np.uint16
, which limits us to using tokenizers with vocabularies < 65536. We could change it to int32, although it is true that we would double the space.
I have also redesigned the MMapIndexedDataset
, the heart of the Nanosets
. Now it is much simpler, as it only contains the np.memmap
to read the tokens. It contains the __getitem__
method, although the important one is the get
, which allows us to access an offset
position of the array and extract length
tokens. We don't need more for CausalLM.
And where did all the sample and document index things from the Nanoset
go? Actually, we don't need them, as the only thing that interests us is to extract sequence length + 1
tokens from the mmap dataset. This way, we save a lot of time in the process of building the indices. Now we simply focus on how many tokens our mmap array has, based on the number of tokens and the sequence length we compute the total number of samples that we can generate from the array and divide them into train, valid, and test.
What we do keep to access the Nanoset
indices correctly is the shuffle index and the same logic we used before to build them (Concatenating them in the train split to ensure that we have enough samples), since it is worth remembering that the 3 splits access the same MMapIndexedDataset
, but at different positions.
I have updated the docs and here I have uploaded the logs with the same configuration that I published a few days ago. The only thing that changes is the number of samples per Nanoset split, as we now do the division at the token level and not at the document level. I have also added a new test to verify that we are able to reset the state of the dataloader when restarting training. I have removed the license, as we do not depend on any other project. In the end, we only rely on numpy.memmap
like many other projects (Megatron, OLMo, etc.).
Toni
Hello!
Thank you for your comments, I will review them later. Regarding what you mentioned, you have to start the training with _run_trainnanoset.py and not _runtrain.py. If you prefer, I can merge everything.
Toni
Hello!
I have integrated Nanosets into run_train.py
. I have added a yaml config file to run experiments. Previously, it will be necessary to download and preprocess two datasets. I have chosen yelp_review_full and HuggingFaceH4/testing_alpaca_small from the Hugging Face Hub, and we use the GPT2 Tokenizer.
We download and preprocess the datasets as follows:
python3 tools/preprocess_data.py \
--input yelp_review_full \
--split train \
--output-prefix datasets/yelp_review_full \
--pretrained-model-name-or-path gpt2 \
--num-workers 16
python3 tools/preprocess_data.py \
--input HuggingFaceH4/testing_alpaca_small \
--split train \
--column completion \
--output-prefix datasets/testing_alpaca_small \
--pretrained-model-name-or-path gpt2 \
--num-workers 16
We launch the job with:
torchrun --nproc-per-node 4 run_train.py --config examples/config_nanoset.yaml
I have tested it with a setup with 4 GPUs.
Toni.
Update: I've seen that some tests have failed, 2 from test_build_nanoset_dataloader
and 1 from test_recover_nanoset_dataloader
. The errors aren't assertion errors but torch distributed errors as it's complaining to bind sockets. I also experienced this errors in my setup, but reducing a bit more the workers solved the issue.
I've pushed another version keeping the 2 tests separately, but reducing the quantity of parametrise configs, so we perform less tests. From what I've experienced, the tests fails due to this PyTorch error 1% of the times and always in different config. The last would be returning to a single test, as it passed the tests without any issues.
Update2: Adding back @rerun_if_address_is_in_use()
decorator solves all the issues.
Hi!
Throughout this week, I'll check all your recommendations, thank you!
transformers
and datasets
are necessary for data processing (mainly transformers
for the tokenizer) so I've though of creating a new flavour/extension for the Nanosets and including them to the 3d_parallelism_unit_tests.yaml workflow.
Toni
Don't merge this PR; I'll open a new one with a truly Nano dataset ๐
Update: Moved to #155
What does this PR do?
Solves #45
The current version of Nanotron's input pipelines is based on Hugging Face Datasets and relies on
clm_preprocess
, which tokenises and preprocesses the entire dataset at the beginning of the training (linked to thesequence_length
, making it even more difficult to reuse the result across different experiments).I have developed new data input pipelines based on those included in Megatron. Since I didn't want Nanotron to lose its essence, I removed many functionalities that we don't need (such as those related to BERT models pretraining). What I mainly modified is the
torch.utils.data.Dataset
, and we continue to work with the same Sampler, Collator and DataLoader (* I had to modify them slightly), so it doesn't alter the behavior of other modules like the PipelineEngine at all. It also allows us to continue using the previous pipeline based on Hugging Face Datasets, since I added the scriptrun_train_nanoset.py
to launch the training with the new pipeline.Relevant details:
NanosetDatasetsArgs
, which can replace `PretrainDatasetsArgs. You only need to specify the path to the dataset (generated by Megatron's preprocess_data.py, without the extension as they specify) and the distribution of the dataset samples for each of the splits (train, valid, and test) so that it sums up to 1.Nanoset
will be the new dataset format. It is a lighter version ofGPTDataset
andMegatronDataset
from Megatron.NanosetBuilder
, which, based on aNanosetConfig
(ContainsNanosetDatasetsArgs
+ other details), will build aNanoset
for each split. In this first version, we only support one dataset file, but I will include the possibility of using multiple files (BlendedNanoset), hence preserving theNanosetBuilder
.Nanoset
contains anMMapIndexedDataset
. This object is found inindexed_dataset.py
and comes from fairseq. Megatron also includes it as such.I think maybe we should centralize the input data pipelines and perhaps move the
dataloader.py
file to another location. I also propose moving several functions from this file with the comments # Question:.To use the Nanoset datasets, you need to specify the
data_path
andsplit
fields inconfig.data.dataset
in the .yaml file and use the scriptrun_train_nanoset.py
in the same way asrun_train.py
.I've published the wandb logs of the different tests I have carried out, comparing the HF Datasets and the new Nanoset datasets with 1 and 4 GPUs and resuming training from a checkpoint.
This is a first version, I am open to all suggestions you can think of! I have named it Nanoset, but as I said, you are free to change it!
Toni