togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.53k stars 346 forks source link

Filtering on Document Length #118

Open karan-dalal opened 1 month ago

karan-dalal commented 1 month ago

Hi there, thank you for this great release!

I'm wondering if it would be possible to use the quality filter to filter out documents under a certain length. For example, I'm looking to assemble a dataset where each sequence is between 64-128k in context.

Thank you for your help.