Open fancyerii opened 1 year ago
Yes.😃Our consideration is that we can segment documents in advance according to some rules. For instance, for books (long data) in Redpajama, we can split them into different chapters/paragraphs before tokenization. I can also see your point, you think we should implement a packing dataset that will automatically packing short data and split long data during tokenization process. I think both are practical implementation.
yes, I think it should be documented clearly that users should segment their inputs or else their data will be truncated.
Describe the feature
I found both the two examples will truncate text longer than max_length. So we have to segment long text to short ones. For examples/language/llama2, the codes are:
It's obvious that long text (2048/4096) will be truncated. And the default redpajama dataset will have very long text.
The codes for applications/Colossal-LLaMA-2
it's also truncated by post processing codes.