Closed snarayan21 closed 1 month ago
Ran the following scripts successfully: convert_dataset_hf.py
, convert_dataset_json.py
, convert_finetuning_dataset.py
, convert_text_to_mds.py
. Updated data prep readme to have instructions for convert_text_to_mds.py
.
Using the shakespeare text file here, models trained with/without this branch have deterministic loss curves. One set of runs was with global batch size 32, the other with global batch size 256. See wandb project here.
Foundry regression tests are partially borked right now because of a small bug that's getting addressed in the release branch, but the tests that did run all succeeded. See here.
Depending on the vocab size, users can encode their token IDs using various int formats. Previously, we only allowed for int64, which covers an absurdly high vocab size. Enabling decoding tokens in uint32 or uint16, for example, would let people save space on their datasets since the max vocab sizes supported would be ~4 million with uint32, or ~65k with uint16. This has been added to both the text and finetuning datasets.
This PR also lets users specify their MDS dataset columns using
ndarray
types to enable automatically encoding/decoding samples. This was already present for finetuning dataset, so the functionality has been added for the generic text dataset. Accordingly, I've changed the default value in our MDS conversion scripts to usendarray:uint32
instead ofbytes
and made relevant changes to get this working.Added unit tests checking that this works for text and finetuning datasets and that an error is thrown for uncompatible encoding types.
Moved a util function that was applicable to both the text and finetuning dataloaders to a common location to import. This had been written twice to perform the same functionality.
Ran the following scripts successfully:
convert_dataset_hf.py
,convert_dataset_json.py
,convert_finetuning_dataset.py
,convert_text_to_mds.py
. Updated data prep readme to have instructions forconvert_text_to_mds.py
.Using the shakespeare text file here, models trained with/without this branch have deterministic loss curves. One set of runs was with global batch size 32, the other with global batch size 256. See wandb project here.
Foundry regression tests are partially borked right now because of a small bug that's getting addressed in the release branch, but the tests that did run all succeeded. See here.