huggingface / OBELICS

Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.
https://huggingface.co/datasets/HuggingFaceM4/OBELICS
Apache License 2.0
171 stars 9 forks source link

nsfw filtered texts only file missing at step 08_01 #10

Closed shaharukhkhan4350 closed 1 month ago

shaharukhkhan4350 commented 1 month ago

Hi @HugoLaurencon !

Thanks for providing the code! Really helpful!

I am trying to reproduce the pipeline, however I am having issues to find this file at line:

PATH_WEB_DOCS_S3 = "s3://m4-datasets/webdocs/web_document_dataset_filtered_imgurldedup_nsfwfiltered_texts_only/"

As far as I can see, the output of file 7_03 is this.

I am not sure how to get the web_document_dataset_filtered_imgurldedup_nsfwfiltered_texts_only dataset. Did I miss anything?

Thanks for your response in advance!

@shubhamagarwal92

HugoLaurencon commented 1 month ago

Hi thanks!

I think the output of 07_03 is rather

PATH_SAVE_S3_WEB_DOCS_NSFW_FILTERED = os.path.join(
    "s3://m4-datasets/webdocs/web_document_dataset_filtered_imgurldedup_nsfwfiltered", str(IDX_JOB)
)

In this case you can just replace the images by None to obtain your file

shaharukhkhan4350 commented 1 month ago

Thanks @HugoLaurencon !

Closing the issue