I'm wondering if it would be possible to use the quality filter to filter out documents under a certain length. For example, I'm looking to assemble a dataset where each sequence is between 64-128k in context.
Is this easily configurable in the quality filter?
Would this filter be applied before or after tokenization?
Hi there, thank you for this great release!
I'm wondering if it would be possible to use the quality filter to filter out documents under a certain length. For example, I'm looking to assemble a dataset where each sequence is between 64-128k in context.
Thank you for your help.