sambanova / generative_data_prep

Apache License 2.0
58 stars 8 forks source link

Track Metrics About Tokenization Results #44

Closed snova-zoltanc closed 1 year ago

snova-zoltanc commented 1 year ago

Summary

There are a lot of metadata details about the dataset that would be useful to know, but it would require writing non trivial python script to extract that from the HDF5 files. So lets keep track of all this metadata and print it so the use can learn more about their resulting dataset.

How the output looks like

image

Things to note:

  1. I will add a follow up PR to log this in a log file, so this information is not lost in the terminal.
  2. There was a major bug in the code during greedy packing, that mean't that if two sequences in a row did not fit, it would add a sequence of all PADDING tokens. This PR fixes this, and the associated test cases. The issue probably DID NOT impact previous tokenization runs because by default later steps drops sequences without completions, so these sequences are dropped unless a flag was passed in to skip this all prompt sequence dropping stage.
  3. I updated the README with a table that describes the metrics with the definition and intuition behind them.

PR Checklist

snova-zoltanc commented 1 year ago

image

Create a nice table for tracking dataset metrics.

snova-ranjanl commented 1 year ago

@snova-zoltanc actually, your additional_dependency for tabulate is fine to be in the pre-commit config file.