rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.71k stars 338 forks source link

Implement sanity check for hash value #346

Closed geroldmeisinger closed 10 months ago

geroldmeisinger commented 1 year ago

laion2b-en-aesthetics65 contains a "hash" column (int64) according to https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6.5plus/discussions/3 this column is useless MD5 uses 128bit SHA256 uses 256bit

Please implement a sanity check when using compute_hash and verify_hash if this could work at all (int64 != 128bit). Print an error message when a lot of hashes seem to fail "A lot of hashes seem to fail which is unusual. Please check your arguments and make sure you are using he correct hash algorithm"

rom1504 commented 1 year ago

I left a comment on your link clarifying what that hash column is

https://github.com/rom1504/img2dataset/blob/main/dataset_examples/laion5B.md#with-md5-hashes-in-addition explains how to use the image hash if that's what you need

As for implementing a sanity check for hash, I agree that would be helpful

geroldmeisinger commented 1 year ago

thanks for the info and the quick response. so it appears this is only available for laion2b-en (which alone is 500GB) not laion2B-en-aesthetics 6.5? and you would have to join them manually?

rom1504 commented 1 year ago

You can join them yeah

On Tue, Sep 5, 2023, 11:27 Gerold Meisinger @.***> wrote:

thanks for the info and the quick response. so it appears this is only available for laion2b-en not laion2B-en-aesthetic (which alone is 500GB)? and you would have to join them manually?

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/346#issuecomment-1706264699, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437QJGQQD2YF7YENVNY3XY3V6TANCNFSM6AAAAAA4LMHP6U . You are receiving this because you commented.Message ID: @.***>