I modified the script to utilize data classes, JSON serialization, and the tqdm library, ensuring a seamless and informative data download process. It also offers options to specify data sizes, splits, and target example counts. (cool, cool!)
Little list of changes:
Added a data class (ChatData) for structuring GPT-related data.
Implemented a JSON encoder (ChatDataEncoder) for custom serialization.
Created a class (GPTData) to manage data download, processing, and saving.
Introduced methods for validating data sizes and splits.
Utilized tqdm for a progress bar during data download.
Provided options for truncating data based on a target example count.
It works perfectly—I've tested all sizes and splits. I also tried various example sizes and all in general. It worked flawlessly on my local machine (Linux).
I modified the script to utilize data classes, JSON serialization, and the
tqdm
library, ensuring a seamless and informative data download process. It also offers options to specify data sizes, splits, and target example counts. (cool, cool!)Little list of changes:
ChatData
) for structuring GPT-related data.ChatDataEncoder
) for custom serialization.GPTData
) to manage data download, processing, and saving.tqdm
for a progress bar during data download.Usage (I thought this was necessary, soooo):
Testing:
It works perfectly—I've tested all sizes and splits. I also tried various example sizes and all in general. It worked flawlessly on my local machine (Linux).