Track Metrics About Tokenization Results

snova-zoltanc commented 1 year ago

Summary

There are a lot of metadata details about the dataset that would be useful to know, but it would require writing non trivial python script to extract that from the HDF5 files. So lets keep track of all this metadata and print it so the use can learn more about their resulting dataset.

How the output looks like

Things to note:

I will add a follow up PR to log this in a log file, so this information is not lost in the terminal.
There was a major bug in the code during greedy packing, that mean't that if two sequences in a row did not fit, it would add a sequence of all PADDING tokens. This PR fixes this, and the associated test cases. The issue probably DID NOT impact previous tokenization runs because by default later steps drops sequences without completions, so these sequences are dropped unless a flag was passed in to skip this all prompt sequence dropping stage.
I updated the README with a table that describes the metrics with the definition and intuition behind them.

PR Checklist

[ X] My PR is less than 500 lines of code
[X ] I have added sufficient comment as docstrings in my code
[X ] I have made corresponding changes to the documentation
[ ] I have written unit-tests to test all of my code

snova-zoltanc commented 1 year ago

Create a nice table for tracking dataset metrics.

snova-ranjanl commented 1 year ago

@snova-zoltanc actually, your additional_dependency for tabulate is fine to be in the pre-commit config file.

sambanova / generative_data_prep

Track Metrics About Tokenization Results #44

Summary

PR Checklist