Closed ladit closed 3 months ago
We then run the same cc-net pipeline on warc_wikipedia.warc, which produces warc_wikipedia.warc.wet.
I have sample problem finding source code to convert warc to wet. Please assits
hello, @ladit , i'm doing for the zh language, it seems i should download the wiki(zh) and use part of the pipelines to preprocessing it. while there's similar questions like yours, i confused about the README. and the code seems difficult to debug, have you solved the problem?
@newbietuan No. I am still waiting for instructions from the contributors.
@ladit thanks for your reply. Indeed, what i want to do is input a paragraph, output the the score of quality, if it means that i just need to load the zh.arpa.bin and zh.sp.model and use the code of perplexity.py, it seems not use the fasttext, first time using this, if you have any clue~
@newbietuan No. I am still waiting for instructions from the contributors.
and the most important, i noticed the code of `
def get_tokenizer(self, lang: str) -> Optional[RobustTokenizer]:
cache = self.tokenizers
if lang in cache:
return cache[lang]
if lang in ("th", "zh", "ja"):
return None`
in tokenizer.py, but the paper 3.4 said 'More precisely, for each language, we train a sentence piece tokenizer (Kudo, 2018) and a language model on data from the targeted domain.'
Hi @ladit, thanks for your questions and apologies for the late answer.
Regarding the extracted_urls.txt file, how was the decision made to keep only 300K pages out of the 38M URLs processed from the Wikipedia dump? Should I follow the same ratio for the zhwiki-20230420-pages-articles-multistream.xml file, which is smaller than the English one?
The important point here is that you have enough sampes to train your fasttext classifier -- how many urls do you have in the zh wikipedia dump?
Can you provide more guidance on how to run the pipeline on this file? I had read through the cc_net code and found nothing about wikipedia processing other than data_prep/cc/cc_net/cc_net/get_wiki_cirrus.py. But it seems download from https://dumps.wikimedia.org/other/cirrussearch/current/zhwiki-20230501-cirrussearch-content.json.gz.
For this step you need to run the ccnet pipeline on the warc_wikipedia.warc files to produce .wet files. You basically have to change this line in the ccnet pipeline: https://github.com/facebookresearch/cc_net/blob/main/cc_net/process_wet_file.py#LL38C16-L38C16 We are going to provide more instructions on this soon, to make it easier in the future.
I notice that the input of create_corpus.py is ["cc_net/data/mined/wikipedia/en_head_0000.json.gz", "cc_net/data/mined/wikipedia/en_middle_0000.json.gz"](maybe parsing an argument is better). Can you provide instructions on how to obtain these files?
These files are produced by the ccnet pipeline applied to the wikipedia_warc file. Passing an argument is definitely easier, thanks for the suggestion! We will change this in the next version.
Can you clarify whether it should be run on cc_net/data/mined/{CC_DUMP}/*.gz? The glob here may be ambiguous.
You can run it on the mined outputs -- in our case we had multiple dumps which is the reason for the additonal *
.
@mauriceweber did you ever provide instructions for putting the wet file through the cc_net pipeline?
@newbietuan No. I am still waiting for instructions from the contributors.
and the most important, i noticed the code of
def get_tokenizer(self, lang: str) -> Optional[RobustTokenizer]: cache = self.tokenizers if lang in cache: return cache[lang] if lang in ("th", "zh", "ja"): # TODO find a tokenizer for those languages return None
in tokenizer.py, but the paper 3.4 said 'More precisely, for each language, we train a sentence piece tokenizer (Kudo, 2018) and a language model on data from the targeted domain.'
In fact, after reviewing the code, I found that this file is not being used. Instead, the SentencePiece library is used directly.
Thank you for your work! I am preprocessing for another language(zh). I have some questions regarding the provided instructions:
Regarding the
extracted_urls.txt
file, how was the decision made to keep only 300K pages out of the 38M URLs processed from the Wikipedia dump? Should I follow the same ratio for the zhwiki-20230420-pages-articles-multistream.xml
file, which is smaller than the English one?Can you provide more guidance on how to run the pipeline on this file? I had read through the
cc_net
code and found nothing about wikipedia processing other thandata_prep/cc/cc_net/cc_net/get_wiki_cirrus.py
. But it seems download fromhttps://dumps.wikimedia.org/other/cirrussearch/current/zhwiki-20230501-cirrussearch-content.json.gz
.I notice that the input of
create_corpus.py
is["cc_net/data/mined/wikipedia/en_head_0000.json.gz", "cc_net/data/mined/wikipedia/en_middle_0000.json.gz"]
(maybe parsing an argument is better). Can you provide instructions on how to obtain these files?Can you clarify whether it should be run on
cc_net/data/mined/{CC_DUMP}/*.gz
? The glob here may be ambiguous.Lastly, I would appreciate it if you could improve the
Quality Classifier
section in the README and scripts in the data_prep/cc/classifier
to make it easier for newcomers to follow. Thank you!