There was a major bug in the code during greedy packing, that mean't that if two sequences in a row did not fit, it would add a sequence of all PADDING tokens. This PR fixes this, and the associated test cases. The issue probably DID NOT impact previous tokenization runs because by default later steps drops sequences without completions, so these sequences are dropped unless a flag was passed in to skip this all prompt sequence dropping stage.
PR Checklist
[x] My PR is less than 500 lines of code
[x] I have added sufficient comment as docstrings in my code
[x] I have made corresponding changes to the documentation
[ ] I have written unit-tests to test all of my code
Summary
There was a major bug in the code during greedy packing, that mean't that if two sequences in a row did not fit, it would add a sequence of all PADDING tokens. This PR fixes this, and the associated test cases. The issue probably DID NOT impact previous tokenization runs because by default later steps drops sequences without completions, so these sequences are dropped unless a flag was passed in to skip this all prompt sequence dropping stage.
PR Checklist