There are a lot of metadata details about the dataset that would be useful to know, but it would require writing non trivial python script to extract that from the HDF5 files. So lets keep track of all this metadata and print it so the use can learn more about their resulting dataset.
How the output looks like
Things to note:
I will add a follow up PR to log this in a log file, so this information is not lost in the terminal.
There was a major bug in the code during greedy packing, that mean't that if two sequences in a row did not fit, it would add a sequence of all PADDING tokens. This PR fixes this, and the associated test cases. The issue probably DID NOT impact previous tokenization runs because by default later steps drops sequences without completions, so these sequences are dropped unless a flag was passed in to skip this all prompt sequence dropping stage.
I updated the README with a table that describes the metrics with the definition and intuition behind them.
PR Checklist
[ X] My PR is less than 500 lines of code
[X ] I have added sufficient comment as docstrings in my code
[X ] I have made corresponding changes to the documentation
[ ] I have written unit-tests to test all of my code
Summary
There are a lot of metadata details about the dataset that would be useful to know, but it would require writing non trivial python script to extract that from the HDF5 files. So lets keep track of all this metadata and print it so the use can learn more about their resulting dataset.
How the output looks like
Things to note:
PR Checklist