Closed geroldmeisinger closed 10 months ago
I left a comment on your link clarifying what that hash column is
https://github.com/rom1504/img2dataset/blob/main/dataset_examples/laion5B.md#with-md5-hashes-in-addition explains how to use the image hash if that's what you need
As for implementing a sanity check for hash, I agree that would be helpful
thanks for the info and the quick response. so it appears this is only available for laion2b-en (which alone is 500GB) not laion2B-en-aesthetics 6.5? and you would have to join them manually?
You can join them yeah
On Tue, Sep 5, 2023, 11:27 Gerold Meisinger @.***> wrote:
thanks for the info and the quick response. so it appears this is only available for laion2b-en not laion2B-en-aesthetic (which alone is 500GB)? and you would have to join them manually?
— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/346#issuecomment-1706264699, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437QJGQQD2YF7YENVNY3XY3V6TANCNFSM6AAAAAA4LMHP6U . You are receiving this because you commented.Message ID: @.***>
laion2b-en-aesthetics65 contains a "hash" column (int64) according to https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6.5plus/discussions/3 this column is useless MD5 uses 128bit SHA256 uses 256bit
Please implement a sanity check when using compute_hash and verify_hash if this could work at all (int64 != 128bit). Print an error message when a lot of hashes seem to fail "A lot of hashes seem to fail which is unusual. Please check your arguments and make sure you are using he correct hash algorithm"