rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.7k stars 338 forks source link

Consider to not recommend skip_reencode #243

Open rom1504 opened 1 year ago

rom1504 commented 1 year ago

This is what is causing some image failures, confirmed @rwightman

maybe it's fixable

rom1504 commented 1 year ago

(I downloaded without it, and no warning when using the dataset)

rom1504 commented 1 year ago

I'm considering to make --image_size 384 --resize_mode "keep_ratio" --resize_only_if_bigger True the new default and recommend that in examples

rom1504 commented 1 year ago

Example of broken images from Amy Roberts (HF):

Here’s a selection of images which have this issue. Some of them are corrupted on the URL e.g. https://www.trip-blog.net/wp-content/uploads/2013/04/scottish-walking-group-40242_300x250.jpg and some seem to have had issues upon saving e.g. https://www.oggi-in-tv.it/images/chernobyl-the-last-battle-of-the-ussr.jpg. None were resized

/fsx/phenaki/coyo-700m/coyo-data-2/22725.tar { "clip_similarity_vitl14": 0.255615234375, "image_phash": "dbca8811a73e34b6", "num_faces": 0, "watermark_score": 0.052227362990379333, "aesthetic_score_laion_v2": 5.197615623474121, "caption": "5 Things to do in Andorra", "url": "https://www.trip-blog.net/wp-content/uploads/2015/04/5-Things-to-do-in-Andorra-79384_300x250.jpg", "key": "227254079", "status": "success", "error_message": null, "width": 300, "height": 250, "original_width": 300, "original_height": 250, "exif": "{}", "md5": "46884a7d6507b1b5a595f7d28df93d96" } /fsx/phenaki/coyo-700m/coyo-data-2/26550.tar { "clip_similarity_vitl14": 0.287841796875, "image_phash": "9be67830b621e4bc", "num_faces": 2, "watermark_score": 0.213081955909729, "aesthetic_score_laion_v2": 4.816852569580078, "caption": "Avengers marvel now - Tome 2", "url": "https://servimg.eyrolles.com/static/media/4018/9782809444018_internet_w290.jpg", "key": "265502045", "status": "success", "error_message": null, "width": 250, "height": 373, "original_width": 250, "original_height": 373, "exif": "{}", "md5": "df5360c45cb14c79c45787cab5f3842a" } /fsx/phenaki/coyo-700m/coyo-data-2/69277.tar { "clip_similarity_vitl14": 0.2152099609375, "image_phash": "9ebe411ba4710ee5", "num_faces": 0, "watermark_score": 0.018437210470438004, "aesthetic_score_laion_v2": 5.356045246124268, "caption": "Mhc Quiet Deluxe Suite Near Downtown", "url": "https://cdn.quick-sell.ro/b643e1291c2690d7c05efd50c12ba422/https%3A%2F%2Fcdn.tourismcloudservice.com%2FHotelsV3%2F637474%2F20202281223576.jpg", "key": "692770902", "status": "success", "error_message": null, "width": 500, "height": 375, "original_width": 500, "original_height": 375, "exif": "{}", "md5": "6374114e77b6f62090b4a6667eec81a0" } /fsx/phenaki/coyo-700m/coyo-data-2/63829.tar { "clip_similarity_vitl14": 0.26025390625, "image_phash": "b2861d64e4c9ed71", "num_faces": 0, "watermark_score": 0.0002586680056992918, "aesthetic_score_laion_v2": 5.084773063659668, "caption": "Microtel Inn & Suites Norcross", "url": "https://cdn.travitude.co.uk/979eb34519a70d2737ad9c78c43c98d9/https%3A%2F%2Fwww.hotelbeds.com%2Fgiata%2F29%2F295798%2F295798a_hb_w_009.jpg", "key": "638290673", "status": "success", "error_message": null, "width": 320, "height": 213, "original_width": 320, "original_height": 213, "exif": "{}", "md5": "7249fa194ac7845708d8c47b09293919" } /fsx/phenaki/coyo-700m/coyo-data-2/27317.tar { "clip_similarity_vitl14": 0.1729736328125, "image_phash": "906f304fa0b74f74", "num_faces": 0, "watermark_score": 0.05018091946840286, "aesthetic_score_laion_v2": 4.531515598297119, "caption": "Zentral Center (Adults Only 14+", "url": "https://cdn.quick-sell.ro/9b3719ca3493821464b58c756806f554/https%3A%2F%2Frezervari.paralela45.ro%2Fimg_of%2FH6077_C0_13864.jpg", "key": "273171134", "status": "success", "error_message": null, "width": 500, "height": 375, "original_width": 500, "original_height": 375, "exif": "{}", "md5": "0a85abd5d9ddb2d1d772ff4209c717e2" } /fsx/phenaki/coyo-700m/coyo-data-2/15145.tar { "clip_similarity_vitl14": 0.321044921875, "image_phash": "aaa77d2faaac2098", "num_faces": 0, "watermark_score": 0.08868316560983658, "aesthetic_score_laion_v2": 4.429727077484131, "caption": "2016 - Chopard Watches - rubber (car tire) band", "url": "http://mediashow.ro/root.php?g2_view=core.DownloadItem&g2_itemId=447925&g2_serialNumber=2&?rndm=bbnc", "key": "151454454", "status": "success", "error_message": null, "width": 200, "height": 267, "original_width": 200, "original_height": 267, "exif": "{}", "md5": "e91512abc73248d1f22ef13b81b63b8d" } /fsx/phenaki/coyo-700m/coyo-data-2/71778.tar { "clip_similarity_vitl14": 0.142578125, "image_phash": "f1992d66466c3bc8", "num_faces": 0, "watermark_score": 0.02955714799463749, "aesthetic_score_laionv2": 5.469544410705566, "caption": "Must-see landscapes of the natural world (part 2", "url": "https://www.trip-blog.net/wp-content/uploads/2013/07/outback-04.ashx-131441_300x250.jpg", "key": "717781985", "status": "success", "error_message": null, "width": 300, "height": 250, "original_width": 300, "original_height": 250, "exif": "{}", "md5": "5ef96d2b18bfb712473af71ccdc187dc" } /fsx/phenaki/coyo-700m/coyo-data-2/05658.tar { "clip_similarity_vitl14": 0.16455078125, "image_phash": "a3b273bf4a43ac28", "num_faces": 0, "watermark_score": 0.0987565666437149, "aesthetic_score_laion_v2": 2.399261474609375, "caption": "Star 115 (25mm) TinkerTech Two Cutters", "url": "https://cdn.shopify.com/s/files/1/2416/9341/products/c6b56835-78ce-4ad0-83ec-8625dee04a4f_large.jpg?v=1514037268", "key": "056585674", "status": "success", "error_message": null, "width": 307, "height": 307, "original_width": 307, "original_height": 307, "exif": "{\"Image HostComputer\": \"imagery4\"}", "md5": "61975bd92e7ee4de3294cc79f6519ec2" } /fsx/phenaki/coyo-700m/coyo-data-2/58523.tar { "clip_similarity_vitl14": 0.2041015625, "image_phash": "ad69c925fad84496", "num_faces": 0, "watermark_score": 0.004144101869314909, "aesthetic_score_laion_v2": 5.134791374206543, "caption": "Mhc Quiet Deluxe Suite Near Downtown", "url": "https://cdn.quick-sell.ro/a29858d7914481a17ee6cb813b6343d4/https%3A%2F%2Fcdn.tourismcloudservice.com%2FHotelsV3%2F637474%2F2020228122355481.jpg", "key": "585231622", "status": "success", "error_message": null, "width": 500, "height": 375, "original_width": 500, "original_height": 375, "exif": "{}", "md5": "02ca09e353096dc2498ab243baf32af7" } /fsx/phenaki/coyo-700m/coyo-data-2/30136.tar { "clip_similarity_vitl14": 0.296142578125, "image_phash": "bfbfc24881e056a3", "num_faces": 0, "watermark_score": 0.050376392900943756, "aesthetic_score_laion_v2": 4.77297306060791, "caption": "Knoos Men Tan Lace-up Casual Shoes", "url": "https://cdn.shopclues.com/images1/thumbnails/100783/320/320/146418599-100783861-1559561618.jpg", "key": "301367748", "status": "success", "error_message": null, "width": 320, "height": 320, "original_width": 320, "original_height": 320, "exif": "{}", "md5": "e7e145a2c6fa8860f99b0d8298fb5661" } /fsx/phenaki/coyo-700m/coyo-data-2/24749.tar { "clip_similarity_vitl14": 0.180419921875, "image_phash": "8fb272cfe1887870", "num_faces": 0, "watermark_score": 5.02593356941361e-05, "aesthetic_score_laion_v2": 4.426836013793945, "caption": "M Central Apartments", "url": "https://cdn.quick-sell.ro/afef5b58d42d9180f63de90d3284625a/https%3A%2F%2Fcdn.tourismcloudservice.com%2FHotelsV3%2F639642%2F202036143838328.jpg", "key": "247499867", "status": "success", "error_message": null, "width": 282, "height": 500, "original_width": 282, "original_height": 500, "exif": "{}", "md5": "eea4cf40c2dba70009f2fdb009ad8766" } /fsx/phenaki/coyo-700m/coyo-data-2/43717.tar { "clip_similarity_vitl14": 0.28564453125, "image_phash": "8787870f07f1f0f0", "num_faces": 0, "watermark_score": 0.14687882363796234, "aesthetic_score_laion_v2": 4.11320161819458, "caption": "Hot sale golden handle color promotional gift ceramic coffee mugs", "url": "https://cdn.goodao.net/alikeso/%E5%BE%AE%E4%BF%A1%E5%9B%BE%E7%89%87_20190504150526-300x300.jpg", "key": "437175072", "status": "success", "error_message": null, "width": 300, "height": 300, "original_width": 300, "original_height": 300, "exif": "{}", "md5": "e6ef6e465c1552241ff150819d1c5318" } /fsx/phenaki/coyo-700m/coyo-data-2/25437.tar { "clip_similarity_vitl14": 0.2025146484375, "image_phash": "979bde478278a11a", "num_faces": 4, "watermark_score": 0.06762968748807907, "aesthetic_score_laion_v2": 4.8978352546691895, "caption": "Chernobyl: the last battle of the Ussr", "url": "https://www.oggi-in-tv.it/images/chernobyl-the-last-battle-of-the-ussr.jpg", "key": "254373346", "status": "success", "error_message": null, "width": 560, "height": 320, "original_width": 560, "original_height": 320, "exif": "{}", "md5": "2fdbf8bae60b893b082452e7bfa94a5d" } /fsx/phenaki/coyo-700m/coyo-data-2/61148.tar { "clip_similarity_vitl14": 0.239990234375, "image_phash": "8ff8f02e0707f0f0", "num_faces": 0, "watermark_score": 0.017646746709942818, "aesthetic_score_laion_v2": 5.162074565887451, "caption": "Moving Expenses and Tax Deductions in 2020", "url": "https://1ststepmovers.com/wp-content/uploads/2020/07/Moving-Expenses-and-Tax-Deductions-2020-1st-Step-Movers-2-500x383.jpg", "key": "611481971", "status": "success", "error_message": null, "width": 500, "height": 383, "original_width": 500, "original_height": 383, "exif": "{}", "md5": "665671542eb2353bbac25798044e7121" } /fsx/phenaki/coyo-700m/coyo-data-2/61803.tar { "clip_similarity_vitl14": 0.1695556640625, "image_phash": "8080d47a7f7f870b", "num_faces": 0, "watermark_score": 0.001754570985212922, "aesthetic_score_laion_v2": 4.825148105621338, "caption": "How to travel with friends for long periods of time (part 2", "url": "https://www.trip-blog.net/wp-content/uploads/2013/04/scottish-walking-group-40242_300x250.jpg", "key": "618036179", "status": "success", "error_message": null, "width": 300, "height": 250, "original_width": 300, "original_height": 250, "exif": "{}", "md5": "7cd0848585cc1d96cbb6dc8e6e1686c1" }