togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Thresholds for all quality signals #92

Open torshie opened 6 months ago

torshie commented 6 months ago

After all quality signals are generated, what are the thresholds used to classify a document as good/bad for each quality signal ?

mauriceweber commented 5 months ago

Hi @torshie, that's a great question -- I think the answer is not entirely clear in the community, but as a starting point you can study the thresholds used in literature (e.g. the gopher rules or the rules used in refined web). Studying data quality and data mixes is an active area of research and getting such an understanding is one of the core motivations behind RPv2.

ZhenweiAn commented 1 month ago

I am eager to know the reference thresholds of all signals too!