huggingface / OBELICS

Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.
https://huggingface.co/datasets/HuggingFaceM4/OBELICS
Apache License 2.0
171 stars 9 forks source link

Is the tot_counter saved twice in this code snippe? #9

Open haiqiang2017 opened 1 month ago

haiqiang2017 commented 1 month ago

tot_counter = Counter() for counter in tqdm(all_counters): tot_counter.update(counter)

with open("/scratch/tot_image_urls_in_web_document_dataset_filtered.pickle", "wb") as f:
    pickle.dump(tot_counter, f, pickle.HIGHEST_PROTOCOL)

command_sync_s3 = (
    "aws s3 cp /scratch/tot_image_urls_in_web_document_dataset_filtered.pickle"
    " s3://m4-datasets/webdocs/tot_image_urls_in_web_document_dataset_filtered.pickle"
)
os.system(command_sync_s3)
os.system(command_sync_s3)
os.system(command_sync_s3)

tot_image_urls_in_web_document_dataset_filtered_too_duplicated = [
    k for k, v in tot_counter.items() if v > THRESHOLD_TOO_DUPLICATED
]

with open("/scratch/tot_image_urls_in_web_document_dataset_filtered_too_duplicated.pickle", "wb") as f:
    pickle.dump(tot_counter, f, pickle.HIGHEST_PROTOCOL)

   Is the tot_counter saved twice in this code snippet? And tot_image_urls_in_web_document_dataset_filtered_too_duplicated is not used,
HugoLaurencon commented 1 month ago

From which file did you get this?

haiqiang2017 commented 1 month ago

[OBELICS]main/build_obelics/06_02_merge_sets_image_urls_in_webdocs.py @HugoLaurencon The code from here

HugoLaurencon commented 1 month ago

Yes you should probably replace tot_counter by tot_image_urls_in_web_document_dataset_filtered_too_duplicated in the second occurrence

haiqiang2017 commented 1 month ago

thanks, I can solve the problem by this method. @HugoLaurencon