sambanova / generative_data_prep

Apache License 2.0
58 stars 8 forks source link

Greedy Packing All Padding Tokens Bug Fix #46

Closed snova-virens closed 1 year ago

snova-virens commented 1 year ago

Summary

There was a major bug in the code during greedy packing, that mean't that if two sequences in a row did not fit, it would add a sequence of all PADDING tokens. This PR fixes this, and the associated test cases. The issue probably DID NOT impact previous tokenization runs because by default later steps drops sequences without completions, so these sequences are dropped unless a flag was passed in to skip this all prompt sequence dropping stage.

PR Checklist