togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Train a new wikiref model #91

Closed torshie closed 5 months ago

torshie commented 7 months ago

I want to apply this pipeline to a new language, but I cannot find a wikipedia reference classifier model for the language.

How is the English wikipedia reference model trained ? Any docs/links/suggestions ?

Thanks.

mauriceweber commented 7 months ago

Hi @torshie , we used the wikipedia reference classifier from RedPajama-v1. To train such a classifier, you can use the code in the rp_v1 branch here -- in data_prep/cc/classifier/ you will find code to train the wikipedia references classifier.