Adding more token encoding types

Depending on the vocab size, users can encode their token IDs using various int formats. Previously, we only allowed for int64, which covers an absurdly high vocab size. Enabling decoding tokens in uint32 or uint16, for example, would let people save space on their datasets since the max vocab sizes supported would be ~4 million with uint32, or ~65k with uint16. This has been added to both the text and finetuning datasets.

This PR also lets users specify their MDS dataset columns using ndarray types to enable automatically encoding/decoding samples. This was already present for finetuning dataset, so the functionality has been added for the generic text dataset. Accordingly, I've changed the default value in our MDS conversion scripts to use ndarray:uint32 instead of bytes and made relevant changes to get this working.

Added unit tests checking that this works for text and finetuning datasets and that an error is thrown for uncompatible encoding types.

Moved a util function that was applicable to both the text and finetuning dataloaders to a common location to import. This had been written twice to perform the same functionality.

Ran the following scripts successfully: convert_dataset_hf.py, convert_dataset_json.py, convert_finetuning_dataset.py, convert_text_to_mds.py. Updated data prep readme to have instructions for convert_text_to_mds.py.

Using the shakespeare text file here, models trained with/without this branch have deterministic loss curves. One set of runs was with global batch size 32, the other with global batch size 256. See wandb project here.

Foundry regression tests are partially borked right now because of a small bug that's getting addressed in the release branch, but the tests that did run all succeeded. See here.

mosaicml / llm-foundry

Adding more token encoding types #1254