Filtering on Document Length

togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.

Apache License 2.0

4.53k stars 346 forks source link

Filtering on Document Length #118

Open karan-dalal opened 1 month ago

karan-dalal commented 1 month ago

Hi there, thank you for this great release!

I'm wondering if it would be possible to use the quality filter to filter out documents under a certain length. For example, I'm looking to assemble a dataset where each sequence is between 64-128k in context.

Is this easily configurable in the quality filter?
Would this filter be applied before or after tokenization?

Thank you for your help.