common_words.json download issue

jrryzh commented 4 months ago

Hi! Thanks for your work! It seems that common_words.json on google cloud is unable to be downloaded, where it shows the file seems to be infected with a virus. Hope you can update the link!

HugoLaurencon commented 4 months ago

Hi and thanks for your interest. The link https://drive.google.com/file/d/1TeydSroOOmlEuxIcwgsJQ2YF4kPJR6N4/view?usp=sharing works fine for me, even in incognito window (where I'm disconnected). I can even preview the json and what's inside. Can you post a screenshot of your problem?

jrryzh commented 4 months ago

Thanks! Here's what it shows when I try to download the file.

HugoLaurencon commented 4 months ago

Can you try these steps?

https://bytesbin.com/file-infected-with-virus-google-drive/?amp=1

If it doesn't work I can try to reupload it but it will likely be the same

It's just a json containing a dictionary with words for the keys and numbers for the values, I don't know why it's considered a virus. You can also check before executing the code.

jrryzh commented 4 months ago

Sorry, it didn't work for me, after I changed the url as taught, the new url seems not to be working, and shows results below.

Could you send me the file via email, my email address is . It would be of so much help, thank you!

HugoLaurencon commented 4 months ago

Well I can't send it via email either because it says it's a virus too. So I advise you to remake this file yourself:

Start from 10M documents from any web-scale dataset, like https://huggingface.co/datasets/oscar-corpus/OSCAR-2301
Extract the words from them by stripping the documents on the whitespace, plus strip the punctuation for the extracted words
Count the words and remove the ones that appear only once

jrryzh commented 4 months ago

Thank you! Your instruction is clear, and it seems easy to implement. I have just another question and wish you can answer, in the paper, between raw cc data and simplifying html data, there's a stage of early text dedup & quality classification, and I don't find them in the code. Is this part missing, and can you guide me how to implement the text dedup and especially the quality classifier. Thanks!

VictorSanh commented 4 months ago

or put it in a hf datasets. datasets has automatic json loading :)

HugoLaurencon commented 4 months ago

or put it in a hf datasets. datasets has automatic json loading

Yes indeed I'll upload the file there tomorrow

I have just another question and wish you can answer, in the paper, between raw cc data and simplifying html data, there's a stage of early text dedup & quality classification, and I don't find them in the code. Is this part missing, and can you guide me how to implement the text dedup and especially the quality classifier. Thanks!

Yes it's true that's it's not present in this code. In our project, we use data after this step as a starting point, these operations were done by other engineers at HF.

For the text deduplication it is the classical Min Hash deduplication and there are implementations online, probably in the library datatrove https://github.com/huggingface/datatrove

For the quality filtering to be honest I don't like this step. It's a binary classifier trained with high quality sources as positive examples and Common Crawl documents as negative examples. However it removed tons of documents, we don't know either what it removes (potentially good math or code data?)

I think these bad data would have been caught with our following filters.

If you want to filter more, it's better to either do perplexity filtering with small models and a soft threshold to control what you filter (we did this with KenLM models), or do a more aggressive filtering but using much better models (like a good LLM) that you know are trained on diverse data

jrryzh commented 4 months ago

Thank you so much! The information you provide is very helpful!

It's a binary classifier trained with high quality sources as positive examples and Common Crawl documents as negative examples. However it removed tons of documents, we don't know either what it removes (potentially good math or code data?)

I think it is necessary because operating on such large amount of data is costly, but it may need to be done better as you say.

I guess I will follow the deduplication and do a soft perplexity filtering, and try out classification using good LLM and see how it works. Thank you for your help again and look forward to the file uploaded at HF :)

HugoLaurencon commented 4 months ago

You're welcome. Here you go https://huggingface.co/datasets/HugoLaurencon/common_words/tree/main

jrryzh commented 4 months ago

I can download the file now, thanks ;)!

huggingface / OBELICS

common_words.json download issue #6