togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.57k stars 350 forks source link

Questions about the quality classifier in common crawl #37

Closed ladit closed 3 months ago

ladit commented 1 year ago

Thank you for your work! I am preprocessing for another language(zh). I have some questions regarding the provided instructions:

In extracted_urls.txt, we provide 38M URLs that are processed from the Wikipedia dump. We early stop this process to only keep 300K pages.

Regarding the extracted_urls.txt file, how was the decision made to keep only 300K pages out of the 38M URLs processed from the Wikipedia dump? Should I follow the same ratio for the zhwiki-20230420-pages-articles-multistream.xml file, which is smaller than the English one?

We then run the same cc-net pipeline on warc_wikipedia.warc, which produces warc_wikipedia.warc.wet.

Can you provide more guidance on how to run the pipeline on this file? I had read through the cc_net code and found nothing about wikipedia processing other than data_prep/cc/cc_net/cc_net/get_wiki_cirrus.py. But it seems download from https://dumps.wikimedia.org/other/cirrussearch/current/zhwiki-20230501-cirrussearch-content.json.gz.

python classifier/create_corpus.py > data_train

I notice that the input of create_corpus.py is ["cc_net/data/mined/wikipedia/en_head_0000.json.gz", "cc_net/data/mined/wikipedia/en_middle_0000.json.gz"](maybe parsing an argument is better). Can you provide instructions on how to obtain these files?

for file in glob.glob("common_crawl/*/*/*.gz") in create_corpus.py

Can you clarify whether it should be run on cc_net/data/mined/{CC_DUMP}/*.gz? The glob here may be ambiguous.

Lastly, I would appreciate it if you could improve the Quality Classifier section in the README and scripts in the data_prep/cc/classifier to make it easier for newcomers to follow. Thank you!

tiendung commented 1 year ago

We then run the same cc-net pipeline on warc_wikipedia.warc, which produces warc_wikipedia.warc.wet. I have sample problem finding source code to convert warc to wet. Please assits

newbietuan commented 1 year ago

hello, @ladit , i'm doing for the zh language, it seems i should download the wiki(zh) and use part of the pipelines to preprocessing it. while there's similar questions like yours, i confused about the README. and the code seems difficult to debug, have you solved the problem?

ladit commented 1 year ago

@newbietuan No. I am still waiting for instructions from the contributors.

newbietuan commented 1 year ago

@ladit thanks for your reply. Indeed, what i want to do is input a paragraph, output the the score of quality, if it means that i just need to load the zh.arpa.bin and zh.sp.model and use the code of perplexity.py, it seems not use the fasttext, first time using this, if you have any clue~

newbietuan commented 1 year ago

@newbietuan No. I am still waiting for instructions from the contributors.

and the most important, i noticed the code of `
def get_tokenizer(self, lang: str) -> Optional[RobustTokenizer]: cache = self.tokenizers if lang in cache: return cache[lang] if lang in ("th", "zh", "ja"):

TODO find a tokenizer for those languages

        return None`

in tokenizer.py, but the paper 3.4 said 'More precisely, for each language, we train a sentence piece tokenizer (Kudo, 2018) and a language model on data from the targeted domain.'

mauriceweber commented 1 year ago

Hi @ladit, thanks for your questions and apologies for the late answer.

Regarding the extracted_urls.txt file, how was the decision made to keep only 300K pages out of the 38M URLs processed from the Wikipedia dump? Should I follow the same ratio for the zhwiki-20230420-pages-articles-multistream.xml file, which is smaller than the English one?

The important point here is that you have enough sampes to train your fasttext classifier -- how many urls do you have in the zh wikipedia dump?

Can you provide more guidance on how to run the pipeline on this file? I had read through the cc_net code and found nothing about wikipedia processing other than data_prep/cc/cc_net/cc_net/get_wiki_cirrus.py. But it seems download from https://dumps.wikimedia.org/other/cirrussearch/current/zhwiki-20230501-cirrussearch-content.json.gz.

For this step you need to run the ccnet pipeline on the warc_wikipedia.warc files to produce .wet files. You basically have to change this line in the ccnet pipeline: https://github.com/facebookresearch/cc_net/blob/main/cc_net/process_wet_file.py#LL38C16-L38C16 We are going to provide more instructions on this soon, to make it easier in the future.

I notice that the input of create_corpus.py is ["cc_net/data/mined/wikipedia/en_head_0000.json.gz", "cc_net/data/mined/wikipedia/en_middle_0000.json.gz"](maybe parsing an argument is better). Can you provide instructions on how to obtain these files?

These files are produced by the ccnet pipeline applied to the wikipedia_warc file. Passing an argument is definitely easier, thanks for the suggestion! We will change this in the next version.

Can you clarify whether it should be run on cc_net/data/mined/{CC_DUMP}/*.gz? The glob here may be ambiguous.

You can run it on the mined outputs -- in our case we had multiple dumps which is the reason for the additonal *.

hicotton02 commented 1 year ago

@mauriceweber did you ever provide instructions for putting the wet file through the cc_net pipeline?

dataaug commented 11 months ago

@newbietuan No. I am still waiting for instructions from the contributors.

and the most important, i noticed the code of def get_tokenizer(self, lang: str) -> Optional[RobustTokenizer]: cache = self.tokenizers if lang in cache: return cache[lang] if lang in ("th", "zh", "ja"): # TODO find a tokenizer for those languages return None in tokenizer.py, but the paper 3.4 said 'More precisely, for each language, we train a sentence piece tokenizer (Kudo, 2018) and a language model on data from the targeted domain.'

In fact, after reviewing the code, I found that this file is not being used. Instead, the SentencePiece library is used directly.