Closed zw615 closed 10 months ago
Hi,
I think that's as expected, there is 1% link rot per month and it's already been more than 8 months since initial release, so success rate went from 95 to 87
You may be able to increase this a little bit by changing the user agent.
You could decide to get more samples by downloading laion2B-en instead
On Wed, Feb 8, 2023, 01:22 Zeyu Wang @.***> wrote:
Hi, I have been trying to download LAION-400M, using the same instructions you provided, however, the download is not complete. On a rough estimation, the success rate is about 0.83-0.85. So for a 400M size dataset, I actually get 350+M samples. Here is the content of a typical stats.json file
"HTTP Error 404: Not Found": 594, "success": 8542, "HTTP Error 503: Service Temporarily Unavailable": 11, "HTTP Error 403: Forbidden": 141, "HTTP Error 503: Service Unavailable": 17, "<urlopen error [Errno 113] No route to host>": 8, "Use of image disallowed by X-Robots-Tag directive": 30, "HTTP Error 401: Unauthorized": 9, "<urlopen error [Errno -2] Name or service not known>": 139, "HTTP Error 400: Bad Request": 31, "HTTP Error 500: Internal Server Error": 12, "Image decoding error": 83, "HTTP Error 404: File Not Found": 14, "<urlopen error [Errno 111] Connection refused>": 10, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1131)>": 31, "HTTP Error 521: ": 4, "HTTP Error 530: ": 1, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)>": 18, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'cdn.twistarticle.com'. (_ssl.c:1131)>": 1, "Remote end closed connection without response": 5, "<urlopen error [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1131)>": 5, "URL can't contain control characters. '/tz/ItemImages/Games/Game Boy Advance/Mega%20Man%20Battle%20Network.jpg' (found at least ' ')": 1, "<urlopen error [Errno -5] No address associated with hostname>": 17, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'cdn.webfronts.com'. (_ssl.c:1131)>": 2, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'cdn1.freevap.ch'. (_ssl.c:1131)>": 1, "URL can't contain control characters. '/thumbnail.asp?file=assets/images/pfly images/sweater happy skull_thumbnail.jpg&maxx=150&maxy=0' (found at least ' ')": 1, "<urlopen error [Errno -3] Temporary failure in name resolution>": 62, "HTTP Error 523: ": 4, "'ascii' codec can't encode character '\xf1' in position 44: ordinal not in range(128)": 1, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'today.law.harvard.edu'. (_ssl.c:1131)>": 1, "HTTP Error 502: Bad Gateway": 3, "timed out": 15, "The read operation timed out": 42, "
": 64, "HTTP Error 422: Unprocessable Entity": 3, "HTTP Error 403: Access Denied": 1, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'preview.mp3mixx.com'. (_ssl.c:1131)>": 1, "HTTP Error 503: first byte timeout": 7, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.loccie.com'. (_ssl.c:1131)>": 1, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1131)>": 5, "HTTP Error 415: Unsupported Media Type": 1, "<urlopen error EOF occurred in violation of protocol (ssl.c:1131)>": 4, "OpenCV(4.7.0) /io/opencv/modules/imgcodecs/src/loadsave.cpp:798: error: (-215:Assertion failed) !buf.empty() in function 'imdecode'\n": 5, "HTTP Error 410: Gone": 9, "HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.\nThe last 30x error message was:\nMoved Permanently": 3, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.flowerschennai.com'. (_ssl.c:1131)>": 1, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'r8zlusvr.rocketcdn.com'. (_ssl.c:1131)>": 1, "<urlopen error [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1131)>": 3, "HTTP Error 503: Service Unavailable: Back-end server is at capacity": 3, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'dcspestcontrol.com'. (_ssl.c:1131)>": 1, "[Errno 104] Connection reset by peer": 1, "HTTP Error 308: Permanent redirect": 1, "URL can't contain control characters. '/les%20goupes/T/Triumph (CAN)/The Sport of Kings/The Sport of Kings.jpg' (found at least ' ')": 1, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'partysurprise.co.za'. (_ssl.c:1131)>": 1, "HTTP Error 404: Not found": 1, "HTTP Error 404: The specified resource does not exist.": 1, "HTTP Error 404: ": 2, "<urlopen error [Errno 101] Network is unreachable>": 1, "HTTP Error 500: Domain Not Found": 1, "URL can't contain control characters. @.***?v=1574368035 2x' (found at least ' ')": 1, "HTTP Error 503: Backend is unhealthy": 1, "HTTP Error 404: The specified blob does not exist.": 2, "HTTP Error 520: status code 520": 1, "HTTP Error 308: Permanent Redirect": 1, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'spotted.tv'. (_ssl.c:1131)>": 1, "HTTP Error 429: Too Many Requests": 3, "URL can't contain control characters. '/th?q=Diy Wood Wine Holder' (found at least ' ')": 1, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'images.lightingandfanpros.com'. (_ssl.c:1131)>": 1, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.masedomani.com'. (_ssl.c:1131)>": 1, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'gamefaqs1.cbsistatic.com'. (_ssl.c:1131)>": 1, "HTTP Error 403: The specified account is disabled.": 1, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'images.celebrateexpress.com'. (_ssl.c:1131)>": 1, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.pencalenickhouse.com'. (_ssl.c:1131)>": 1, "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.footesmusic.com'. (_ssl.c:1131)>": 1, "<urlopen error [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:1131)>": 1, "unknown url type: '2428'": 1, " ": 1 Also, according to the OpenCLIP example running code, they get a total of 41455 shards for LAION-400M. But I only get 41408 shards, which is 47 shards less. I used the default number_sample_per_shard=10000, so I am not sure why there is this difference.
I wonder is that normal? How can I download all the 400M data? Thanks a lot!
BTW, I have searched and found a similar issue here https://github.com/rom1504/img2dataset/issues/242, where it is suggested to set up knot resolver for DNS resolving. However, I did set up the knot resolver exactly as the doc, and checked it by dig @localhost google.com. So I think the problem is not the DNS resolver.
— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/277, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437QCIKSF7HDYG4W3KDLWWLRNFANCNFSM6AAAAAAUUTRSBQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hi I tried to download but the size of tar only gets ~20MB and the number of files in file tar gets ~2200 files, this is the command I used
img2dataset --url_list laion400m-meta --input_format "parquet"\
--url_col "URL" --caption_col "TEXT" --output_format webdataset\
--output_folder laion400m-data --processes_count 16 --thread_count 128 --image_size 256\
--save_additional_columns '["NSFW","similarity","LICENSE"]' --enable_wandb True
Did you set up knot resolver?
On Tue, Sep 12, 2023, 13:27 faithfulnguyen @.***> wrote:
Hi I tried to download but the size of tar only gets ~20MB and the number of files in file tar gets ~2200 files, this is the command I used
img2dataset --url_list laion400m-meta --input_format "parquet"\ --url_col "URL" --caption_col "TEXT" --output_format webdataset\ --output_folder laion400m-data --processes_count 16 --thread_count 128 --image_size 256\ --save_additional_columns '["NSFW","similarity","LICENSE"]' --enable_wandb True
— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/277#issuecomment-1715459426, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437SNP76XLLXT3OMX6E3X2A2HJANCNFSM6AAAAAAUUTRSBQ . You are receiving this because you commented.Message ID: @.***>
No, I don't set up knot resolver, I will try to set up the package and try again, thank you.
after installing knot and ban9, the size of the tar file was larger than 20MB but still did not reach 270MB, the size is ~100MB, This is the message I got during the download process:
total - success: 0.375 - failed to download: 0.621 - failed to resize: 0.004 - images per sec: 10 - count: 10000
worker - success: 0.370 - failed to download: 0.627 - failed to resize: 0.003 - images per sec: 10 - count: 10000
total - success: 0.372 - failed to download: 0.624 - failed to resize: 0.004 - images per sec: 20 - count: 20000
4it [17:22, 142.27s/it]worker - success: 0.369 - failed to download: 0.627 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.371 - failed to download: 0.625 - failed to resize: 0.004 - images per sec: 29 - count: 30000
worker - success: 0.374 - failed to download: 0.620 - failed to resize: 0.005 - images per sec: 10 - count: 10000
total - success: 0.372 - failed to download: 0.624 - failed to resize: 0.004 - images per sec: 39 - count: 40000
6it [17:25, 60.78s/it]worker - success: 0.367 - failed to download: 0.629 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.371 - failed to download: 0.625 - failed to resize: 0.004 - images per sec: 49 - count: 50000
worker - success: 0.374 - failed to download: 0.623 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.371 - failed to download: 0.624 - failed to resize: 0.004 - images per sec: 58 - count: 60000
11it [17:32, 10.15s/it]worker - success: 0.368 - failed to download: 0.628 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.371 - failed to download: 0.625 - failed to resize: 0.004 - images per sec: 68 - count: 70000
worker - success: 0.359 - failed to download: 0.636 - failed to resize: 0.005 - images per sec: 10 - count: 10000
total - success: 0.369 - failed to download: 0.626 - failed to resize: 0.004 - images per sec: 77 - count: 80000
worker - success: 0.365 - failed to download: 0.630 - failed to resize: 0.005 - images per sec: 10 - count: 10000
total - success: 0.369 - failed to download: 0.627 - failed to resize: 0.004 - images per sec: 87 - count: 90000
worker - success: 0.358 - failed to download: 0.638 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.368 - failed to download: 0.628 - failed to resize: 0.004 - images per sec: 96 - count: 100000
worker - success: 0.363 - failed to download: 0.633 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.367 - failed to download: 0.628 - failed to resize: 0.004 - images per sec: 106 - count: 110000
15it [17:37, 3.16s/it]worker - success: 0.354 - failed to download: 0.641 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.366 - failed to download: 0.629 - failed to resize: 0.004 - images per sec: 115 - count: 120000
worker - success: 0.356 - failed to download: 0.639 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.366 - failed to download: 0.630 - failed to resize: 0.004 - images per sec: 125 - count: 130000
worker - success: 0.362 - failed to download: 0.635 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.365 - failed to download: 0.631 - failed to resize: 0.004 - images per sec: 134 - count: 140000
worker - success: 0.359 - failed to download: 0.638 - failed to resize: 0.003 - images per sec: 10 - count: 10000
total - success: 0.365 - failed to download: 0.631 - failed to resize: 0.004 - images per sec: 144 - count: 150000
how can I fix this error?
update: This is log from wandb
You should set up only knot resolver, not bind9 and also disable your previous resolver ( eg systemd-resolved)
You can make sure this is working by looking at the CPU usage of Knot in top/htop
Then you can check the error reasons in wandb or in json files in the output folder
On Thu, Sep 14, 2023, 02:54 faithfulnguyen @.***> wrote:
after installing knot and ban9, the size of the tar file was larger than 20MB but still did not reach 270MB, the size is ~100MB, This is the message I got during the download process:
total - success: 0.375 - failed to download: 0.621 - failed to resize: 0.004 - images per sec: 10 - count: 10000 worker - success: 0.370 - failed to download: 0.627 - failed to resize: 0.003 - images per sec: 10 - count: 10000 total - success: 0.372 - failed to download: 0.624 - failed to resize: 0.004 - images per sec: 20 - count: 20000 4it [17:22, 142.27s/it]worker - success: 0.369 - failed to download: 0.627 - failed to resize: 0.004 - images per sec: 10 - count: 10000 total - success: 0.371 - failed to download: 0.625 - failed to resize: 0.004 - images per sec: 29 - count: 30000 worker - success: 0.374 - failed to download: 0.620 - failed to resize: 0.005 - images per sec: 10 - count: 10000 total - success: 0.372 - failed to download: 0.624 - failed to resize: 0.004 - images per sec: 39 - count: 40000 6it [17:25, 60.78s/it]worker - success: 0.367 - failed to download: 0.629 - failed to resize: 0.004 - images per sec: 10 - count: 10000 total - success: 0.371 - failed to download: 0.625 - failed to resize: 0.004 - images per sec: 49 - count: 50000 worker - success: 0.374 - failed to download: 0.623 - failed to resize: 0.004 - images per sec: 10 - count: 10000 total - success: 0.371 - failed to download: 0.624 - failed to resize: 0.004 - images per sec: 58 - count: 60000 11it [17:32, 10.15s/it]worker - success: 0.368 - failed to download: 0.628 - failed to resize: 0.004 - images per sec: 10 - count: 10000 total - success: 0.371 - failed to download: 0.625 - failed to resize: 0.004 - images per sec: 68 - count: 70000 worker - success: 0.359 - failed to download: 0.636 - failed to resize: 0.005 - images per sec: 10 - count: 10000 total - success: 0.369 - failed to download: 0.626 - failed to resize: 0.004 - images per sec: 77 - count: 80000 worker - success: 0.365 - failed to download: 0.630 - failed to resize: 0.005 - images per sec: 10 - count: 10000 total - success: 0.369 - failed to download: 0.627 - failed to resize: 0.004 - images per sec: 87 - count: 90000 worker - success: 0.358 - failed to download: 0.638 - failed to resize: 0.004 - images per sec: 10 - count: 10000 total - success: 0.368 - failed to download: 0.628 - failed to resize: 0.004 - images per sec: 96 - count: 100000 worker - success: 0.363 - failed to download: 0.633 - failed to resize: 0.004 - images per sec: 10 - count: 10000 total - success: 0.367 - failed to download: 0.628 - failed to resize: 0.004 - images per sec: 106 - count: 110000 15it [17:37, 3.16s/it]worker - success: 0.354 - failed to download: 0.641 - failed to resize: 0.004 - images per sec: 10 - count: 10000 total - success: 0.366 - failed to download: 0.629 - failed to resize: 0.004 - images per sec: 115 - count: 120000 worker - success: 0.356 - failed to download: 0.639 - failed to resize: 0.004 - images per sec: 10 - count: 10000 total - success: 0.366 - failed to download: 0.630 - failed to resize: 0.004 - images per sec: 125 - count: 130000 worker - success: 0.362 - failed to download: 0.635 - failed to resize: 0.004 - images per sec: 10 - count: 10000 total - success: 0.365 - failed to download: 0.631 - failed to resize: 0.004 - images per sec: 134 - count: 140000 worker - success: 0.359 - failed to download: 0.638 - failed to resize: 0.003 - images per sec: 10 - count: 10000 total - success: 0.365 - failed to download: 0.631 - failed to resize: 0.004 - images per sec: 144 - count: 150000
is this normal
— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/277#issuecomment-1718498197, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437RCX4TBIUDOXELNKNTX2JITRANCNFSM6AAAAAAUUTRSBQ . You are receiving this because you commented.Message ID: @.***>
yes, I tried to only use the knot but the rate failed ~0.6 for each shard, I added more DNS into the /etc/resolv.conf
and re-download again, the results seem better than before.
yes, I tried to only use the knot but the rate failed ~0.6 for each shard, I added more DNS into the
/etc/resolv.conf
and re-download again, the results seem better than before.
hello~ I recently faced a similar problem with the download rate and the knot resolver couldn't help. Could you please share more details or steps on how you modified the DNS to improve the download rate? Many thanks!~
yes, I modified the file /etc/resolv.conf
because the internet in my company,
here is some public DNS I used to download the dataset
nameserver 8.8.8.8
nameserver 8.8.4.4
nameserver 76.76.2.0
nameserver 76.76.10.0
nameserver 9.9.9.9
nameserver 1.1.1.1
nameserver 1.0.0.1
one more thing, you could try to reduce the number processing count and thread to increase successful rates, because my bandwidth is limited, I need to change the config to adapt with my env,
I used --processes_count 2 --thread_count 32
. The tradeoff between speed and successful rates is just my experiment, not sure that config works with you but you could give it a try. hope that thing helps you.
Yes, I also observed that by reducing the number of processes, my success rate went up quite a bit and I stopped getting DNS errors like: "<urlopen error [Errno -2] Name or service not known>"
.
Hi, I have been trying to download LAION-400M, using the same instructions you provided, however, the download is not complete. On a rough estimation, the success rate is about 0.83-0.85. So for a 400M size dataset, I actually get 350+M samples. Here is the content of a typical
stats.json
fileAlso, according to the OpenCLIP example running code, they get a total of
41455
shards for LAION-400M. But I only get41408
shards, which is47
shards less. I used the defaultnumber_sample_per_shard=10000
, so I am not sure why there is this difference.I wonder is that normal? How can I download all the 400M data? Thanks a lot!
BTW, I have searched and found a similar issue here, where it is suggested to set up knot resolver for DNS resolving. However, I did set up the knot resolver exactly as the doc, and checked it by
dig @localhost google.com
. So I think the problem is not the DNS resolver.