rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.62k stars 336 forks source link

Low success rate on donwloading laion400m #400

Open tchaton opened 7 months ago

tchaton commented 7 months ago

Hey there, @rom1504,

I have been trying to download laion400m using the scripts from an EC2 instance m5n.8xlarge and the success rate is quite poor.

I am getting a success rate of 10 images for 10k requests with the default command in the README.

Any idea why I am doing wrong ?

Best, T.C

rom1504 commented 7 months ago

Did you set up knot resolver?

Please share a wandb link so we can see the error cause

On Sat, Feb 3, 2024, 12:04 PM thomas chaton @.***> wrote:

Hey there,

I have been trying to download laion400m using the scripts from an EC2 instance m5n.8xlarge and the success rate is quite poor.

I am getting a success rate of 10 images for 10k requests with the default command in the README.

Any idea why I am doing wrong ?

Best, T.C

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/400, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437V625Q3GBYBWJJOTKDYRYKTHAVCNFSM6AAAAABCX6KGQWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEYTMNBSHA4DCMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

tchaton commented 7 months ago

Oh interesting. I haven't. Let me try again. What's knot resolver ?

tchaton commented 7 months ago

I am getting errors when trying to install knot resolver too.

⚡ ~ wget https://secure.nic.cz/files/knot-resolver/knot-resolver-release.deb
--2024-02-03 11:16:00--  https://secure.nic.cz/files/knot-resolver/knot-resolver-release.deb
Resolving secure.nic.cz (secure.nic.cz)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘secure.nic.cz’
⚡ ~ sudo dpkg -i knot-resolver-release.deb
sudo: unable to resolve host ip-10-192-12-27: Temporary failure in name resolution
dpkg: error: cannot access archive 'knot-resolver-release.deb': No such file or directory
tchaton commented 7 months ago

Here are the normal logs. Looks like wandb had a Network error (TransientError), entering retry loop

⚡ ~ img2dataset --url_list the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/ --input_format "parquet"\
>          --url_col "URL" --caption_col "TEXT" --output_format webdataset\
>            --output_folder laion400m-data --processes_count 32 --thread_count 128 --image_size 256\
>              --save_additional_columns '["NSFW","similarity","LICENSE"]' --enable_wandb True
Starting the downloading of this file
Sharding file number 1 of 32 called /teamspace/studios/this_studio/the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet
0it [00:00, ?it/s]File sharded in 1294 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
wandb: Currently logged in as: thomas-chaton. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.2
wandb: Run data is saved locally in /teamspace/studios/this_studio/wandb/run-20240203_111216-t4t3ohoz
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run woven-microwave-1
wandb: ⭐️ View project at https://wandb.ai/thomas-chaton/img2dataset
wandb: 🚀 View run at https://wandb.ai/thomas-chaton/img2dataset/runs/t4t3ohoz
wandb: Network error (TransientError), entering retry loop.
1it [04:07, 247.25s/it]worker  - success: 0.002 - failed to download: 0.998 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total   - success: 0.002 - failed to download: 0.998 - failed to resize: 0.000 - images per sec: 42 - count: 10000
17it [04:11,  1.61s/it]wandb: Network error (TransientError), entering retry loop.
22it [04:14,  1.16it/s]wandb: Network error (TransientError), entering retry loop.
24it [04:15,  1.61it/s]worker  - success: 0.008 - failed to download: 0.992 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 83 - count: 20000
worker  - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total   - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 124 - count: 30000
worker  - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 165 - count: 40000
worker  - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 207 - count: 50000
worker  - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 245 - count: 60000
worker  - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 285 - count: 70000
worker  - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 326 - count: 80000
worker  - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total   - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 366 - count: 90000
worker  - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 406 - count: 100000
worker  - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total   - success: 0.005 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 447 - count: 110000
worker  - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 487 - count: 120000
worker  - success: 0.002 - failed to download: 0.998 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 528 - count: 130000
worker  - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 569 - count: 140000
worker  - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 609 - count: 150000
worker  - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 650 - count: 160000
worker  - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 690 - count: 170000
worker  - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 731 - count: 180000
worker  - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 772 - count: 190000
worker  - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 812 - count: 200000
worker  - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 853 - count: 210000
worker  - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 894 - count: 220000
worker  - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 934 - count: 230000
worker  - success: 0.002 - failed to download: 0.999 - failed to resize: 0.000 - images per sec: 42 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 975 - count: 240000
28it [04:21,  1.13s/it]worker  - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 40 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1004 - count: 250000
worker  - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 40 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1038 - count: 260000
worker  - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 40 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1071 - count: 270000
worker  - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 40 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1111 - count: 280000
31it [04:26,  1.32s/it]worker  - success: 0.008 - failed to download: 0.992 - failed to resize: 0.000 - images per sec: 39 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1134 - count: 290000
worker  - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 39 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1173 - count: 300000
worker  - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 39 - count: 10000
total   - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1207 - count: 310000
32it [04:51,  8.13s/it]worker  - success: 0.037 - failed to download: 0.963 - failed to resize: 0.000 - images per sec: 35 - count: 10000
total   - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 1133 - count: 320000
rom1504 commented 7 months ago

I see you're using 32 processes and 128 threads per process. That might be too much for the machine you're using, try to decrease

As for knot resolver, please follow their doc for instructions on how to install it for your distribution

On Sat, Feb 3, 2024, 12:18 PM thomas chaton @.***> wrote:

Here are the normal logs. Looks like wandb had a Network error (TransientError), entering retry loop

⚡ ~ img2dataset --url_list the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/ --input_format "parquet"> --url_col "URL" --caption_col "TEXT" --output_format webdataset> --output_folder laion400m-data --processes_count 32 --thread_count 128 --image_size 256> --save_additional_columns '["NSFW","similarity","LICENSE"]' --enable_wandb TrueStarting the downloading of this fileSharding file number 1 of 32 called /teamspace/studios/this_studio/the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet0it [00:00, ?it/s]File sharded in 1294 shardsDownloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!wandb: Currently logged in as: thomas-chaton. Use wandb login --relogin to force reloginwandb: Tracking run with wandb version 0.16.2wandb: Run data is saved locally in /teamspace/studios/this_studio/wandb/run-20240203_111216-t4t3ohozwandb: Run wandb offline to turn off syncing.wandb: Syncing run woven-microwave-1wandb: ⭐️ View project at https://wandb.ai/thomas-chaton/img2datasetwandb: 🚀 View run at https://wandb.ai/thomas-chaton/img2dataset/runs/t4t3ohozwandb: Network error (TransientError), entering retry loop.1it [04:07, 247.25s/it]worker - success: 0.002 - failed to download: 0.998 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.002 - failed to download: 0.998 - failed to resize: 0.000 - images per sec: 42 - count: 1000017it [04:11, 1.61s/it]wandb: Network error (TransientError), entering retry loop.22it [04:14, 1.16it/s]wandb: Network error (TransientError), entering retry loop.24it [04:15, 1.61it/s]worker - success: 0.008 - failed to download: 0.992 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 83 - count: 20000worker - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 124 - count: 30000worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 165 - count: 40000worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 207 - count: 50000worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 245 - count: 60000worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 285 - count: 70000worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 326 - count: 80000worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 366 - count: 90000worker - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 406 - count: 100000worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 447 - count: 110000worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 487 - count: 120000worker - success: 0.002 - failed to download: 0.998 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 528 - count: 130000worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 569 - count: 140000worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 609 - count: 150000worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 650 - count: 160000worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 690 - count: 170000worker - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 731 - count: 180000worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 772 - count: 190000worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 812 - count: 200000worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 853 - count: 210000worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 894 - count: 220000worker - success: 0.007 - failed to download: 0.993 - failed to resize: 0.000 - images per sec: 41 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 934 - count: 230000worker - success: 0.002 - failed to download: 0.999 - failed to resize: 0.000 - images per sec: 42 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 975 - count: 24000028it [04:21, 1.13s/it]worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 40 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1004 - count: 250000worker - success: 0.004 - failed to download: 0.996 - failed to resize: 0.000 - images per sec: 40 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1038 - count: 260000worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 40 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1071 - count: 270000worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 40 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1111 - count: 28000031it [04:26, 1.32s/it]worker - success: 0.008 - failed to download: 0.992 - failed to resize: 0.000 - images per sec: 39 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1134 - count: 290000worker - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 39 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1173 - count: 300000worker - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 39 - count: 10000total - success: 0.005 - failed to download: 0.995 - failed to resize: 0.000 - images per sec: 1207 - count: 31000032it [04:51, 8.13s/it]worker - success: 0.037 - failed to download: 0.963 - failed to resize: 0.000 - images per sec: 35 - count: 10000total - success: 0.006 - failed to download: 0.994 - failed to resize: 0.000 - images per sec: 1133 - count: 320000

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/400#issuecomment-1925284544, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437UXT2BFEMJHMEKRWVDYRYMJDAVCNFSM6AAAAABCX6KGQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGI4DINJUGQ . You are receiving this because you were mentioned.Message ID: @.***>

tchaton commented 7 months ago

Thanks @rom1504 The machine has 32 CPUs, so I thought it should be fine. I am running inside a docker container, so having some issues to install knot resolver.

I will keep you updated.

rom1504 commented 7 months ago

I think you probably have more network issues than only the DNS if you have 99% failure Maybe some misconfiguration of docker or the cloud provider?

On Sat, Feb 3, 2024, 1:10 PM thomas chaton @.***> wrote:

Thanks @rom1504 https://github.com/rom1504 The machine has 32 CPUs, so I thought it should be fine. I am running inside a docker container, so having some issues to install knot resolver.

I will keep you updated.

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/400#issuecomment-1925304562, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437QSZI5RUQQM3UKCJ4LYRYSKPAVCNFSM6AAAAABCX6KGQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGMYDINJWGI . You are receiving this because you were mentioned.Message ID: @.***>

tchaton commented 7 months ago

Hey @rom1504 Any idea what I should be looking for on the docker or cloud provider side as possible source of issues?

Also, should I use knot or bind9?

rom1504 commented 7 months ago

I advise you use knot, it's better for this use case.

For network issues, could be a lot of things but maybe a limit on the number of open handles/files ?

On Sat, Feb 3, 2024, 1:42 PM thomas chaton @.***> wrote:

Hey @rom1504 https://github.com/rom1504 Any idea what I should be looking for on the docker or cloud provider side?

Also, should I use knot or bind9?

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/400#issuecomment-1925311125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437RWLMAX4FBNH277RDDYRYWDDAVCNFSM6AAAAABCX6KGQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGMYTCMJSGU . You are receiving this because you were mentioned.Message ID: @.***>

tchaton commented 7 months ago

Thanks, @rom1504 I will check this out.

I managed to install knot on the host but it isn't visible inside the container and networking seems broken. Have you ever tried?

tchaton commented 7 months ago

I am also curious what kind of numbers do you get without using knot resolver ?

rom1504 commented 7 months ago

I didn't try to use docker for img2dataset no. Maybe just use the host?

Usually I get 40-50 image/s/core. With 32 cores that would be 1300-1600 image/s

Knot resolver does not change the speed, it increases the success rate

Should be about 80% for laion400m.

On Sat, Feb 3, 2024, 2:40 PM thomas chaton @.***> wrote:

I am also curious what kind of numbers you get without using knot resolver ?

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/400#issuecomment-1925324776, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437RCIZOZDJYI7PHWZDTYRY46DAVCNFSM6AAAAABCX6KGQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGMZDINZXGY . You are receiving this because you were mentioned.Message ID: @.***>

tchaton commented 7 months ago

Hey @rom1504 I am trying to get it working on https://lightning.ai/, so it runs in docker. Yes, my success rate is far from this. So something is wrong.

tchaton commented 7 months ago

@rom1504 Here is the PR I am working on: https://github.com/Lightning-AI/pytorch-lightning/pull/19400 and the API:

I am trying to make data processing efficient while easy to hack around. Here is the example to download laion400m. Still need some extra optimizations.

import os
from multiprocessing.pool import ThreadPool
from lightning.data import optimize
from lightning.data.processing.readers import ParquetReader
from lightning.data.processing.image import download_image
from PIL import Image
from time import sleep

input_dir = "the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta"
parquet_files = [os.path.join(input_dir, f) for f in os.listdir(input_dir) if f.endswith(".parquet")]

def process(row):
    image_id, url, text, height, width, image_license, nsfw, similarity = row
    img, err = download_image(url, 1, timeout=5)
    if err:
        return None, err

    try:
        return [image_id, Image.open(row[1]).resize((224, 224)), text, image_license, nsfw, similarity], None
    except Exception:
        return None, err

class Fetcher:

    def __init__(self, max_threads=32):
        self.max_threads = max_threads

    def __call__(self, df):
        rows = [list(row) for row in df.iter_rows() if row[0] is not None]
        with ThreadPool(self.max_threads) as thread_pool:
            for row, err in thread_pool.imap_unordered(process, rows):
                if err is not None:
                    continue

                yield row

optimize(
    fn=Fetcher(max_threads=16),
    inputs=parquet_files,
    output_dir="/teamspace/datasets/laion400m",
    num_workers=os.cpu_count(),
    reader=ParquetReader(num_rows=2048, to_pandas=False),
    chunk_bytes="64MB",
)

And the associated Streaming library I have been working on:

https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries

If this is ok, I will make a PR to add a Lightning Data writer to img2dataset.

rom1504 commented 7 months ago

Ok. Curious if you reach similar speed as img2dataset and how you'd like to handle distribution

On Sat, Feb 3, 2024, 7:28 PM thomas chaton @.***> wrote:

@rom1504 https://github.com/rom1504 Here is the PR I am working on: Lightning-AI/pytorch-lightning#19400 https://github.com/Lightning-AI/pytorch-lightning/pull/19400 and the API:

I am trying to make data processing efficient while easy to hack around. Here is the example to download laion400m. Still need some extra optimizations.

import osfrom multiprocessing.pool import ThreadPoolfrom lightning.data import optimizefrom lightning.data.processing.readers import ParquetReaderfrom lightning.data.processing.image import download_image_with_retryfrom lightning.data.processing.utilities import SuppressStdoutStderrfrom PIL import Imagefrom time import sleep input_dir = "the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta"parquet_files = [os.path.join(input_dir, f) for f in os.listdir(input_dir) if f.endswith(".parquet")] def process(row): image_id, url, text, height, width, image_license, nsfw, similarity = row img, err = download_image_with_retry(0, url, timeout=5) if err: return None, err

try:
    return [image_id, img, text, image_license, nsfw, similarity], None
except Exception:
    return None, err

class Fetcher:

def __init__(self, max_threads=32):
    self.max_threads = max_threads
    self.stored = 0
    self.skipped = 0

def __call__(self, df):
    print(self.skipped, self.stored)
    rows = [list(row) for row in df.iter_rows() if row[0] is not None]
    with ThreadPool(self.max_threads) as thread_pool:
        for row, err in thread_pool.imap_unordered(process, rows):
            if err is not None:
                self.skipped += 1
                continue

            if row[1] is not None:
                try:
                    row[1] = Image.open(row[1]).resize((224, 224))
                except:
                    self.skipped += 1
                    continue
            yield row
            self.stored += 1

optimize( fn=Fetcher(max_threads=16), inputs=parquet_files, output_dir="/teamspace/datasets/laion400m", num_workers=os.cpu_count(), reader=ParquetReader(num_rows=2048, to_pandas=False), chunk_bytes="64MB", )

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/400#issuecomment-1925422636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437ST3YPBBFTDOR4FJKTYRZ6TXAVCNFSM6AAAAABCX6KGQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGQZDENRTGY . You are receiving this because you were mentioned.Message ID: @.***>

tchaton commented 7 months ago

It seemed Image downloading speeds were quite similar between optimize and img2dataset. But I need to be more principled and collect the same metrics to build a more educated comparison.

But first, I need to resolve the low downloading speed and low success rate behind so low.

But the StreamingDataset is faster than Webdataset though. Actually, you can try it yourself by duplicating my Studio: lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries. It contains everything, python deps, code, data, etc...

I am happy to get on call to chat more about design and optimizations if you are interested.

tchaton commented 7 months ago

The distribution is already fully handled by the optimize and map operators. Check this example: https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset?view=public&section=data+processing

Example to tokenize SlimPajama.

import json
from pathlib import Path
import zstandard as zstd
from lightning.data import optimize
from tokenizer import Tokenizer
from functools import partial
from lightning_sdk import Machine

# 1. Function to tokenize the text contained within the Slimpajama files
def tokenize_fn(filepath, tokenizer=None):
    with zstd.open(open(filepath, "rb"), "rt", encoding="utf-8") as f:
        for row in f:
            text = json.loads(row)["text"]
            if json.loads(row)["meta"]["redpajama_set_name"] == "RedPajamaGithub":
                continue  # exclude the GitHub data since it overlaps with starcoder
            text_ids = tokenizer.encode(text, bos=False, eos=True)
            yield text_ids

# 2. Generate the inputs (we are going to optimize all the compressed json files from SlimPajama dataset)
input_dir = "/teamspace/studios/SlimPajama_Dataset/data/train"
inputs = [str(file) for file in Path(input_dir).rglob("*.jsonl.zst")]

# 3. Store the optimized data wherever you want under "/teamspace/datasets" or "/teamspace/s3_connections"
outputs = optimize(
    fn=partial(tokenize_fn, tokenizer=Tokenizer("./checkpoints/Llama-2-7b-hf")), # Note: You can use HF tokenizer or any others
    inputs=inputs,
    output_dir="/teamspace/datasets/slimpajama/train/",
    chunk_size=(2049 * 8012),
    num_nodes=16,
    machine=Machine.DATA_PREP, # use 32 CPU machine
)

This remotely process the full dataset over 16 nodes and make it processable by the StreamingDataset.

image

Or this one to embed Wikipedia in 15 min: https://lightning.ai/lightning-ai/studios/embed-english-wikipedia-under-5-dollars

tchaton commented 7 months ago

Hey @rom1504 I am able to get 1.1k images/sec.

I think I have a version of knot resolver that works. I am also using http2 from httpx client and I sorted to parquet files by URL to hopefully help slightly the DNS resolving.

But the ratio of success is around 60%, so quite far from yours though. I will try again img2dataset. There is possibly something with docker not well configured.

Best, T.C

rom1504 commented 7 months ago

Be careful with sorting the urls as you risk to dos the hosts. I had randomly shuffled them in laion datasets to mitigate this.

Some people recently have had some success by calling knot with all unique domains to get its cache ready.

Usually I didn't hit issues with DNS when using knot though. Issues only happens in some environments with restricted DNS setup

But the ratio of success is around 60%, so quite far from yours though.

You can log the errors to try and understand what the cause is. In img2dataset there is a wandb table for it.

Hey @rom1504 I am able to get 1.1k images/sec.

Nice! How many cores are you using?

tchaton commented 7 months ago

Hey @rom1504,

Be careful with sorting the urls as you risk to dos the hosts. I had randomly shuffled them in laion datasets to mitigate this.

Interesting. Yes, I didn't think of that. Good call !

Some people recently have had some success by calling knot with all unique domains to get its cache ready.

This is a good idea. I will see if there is a simple way for to add support for this.

Issues only happens in some environments with restricted DNS setup

I am capturing the errors and printing them. I will share what I am getting in couple of hours.

Nice! How many cores are you using?

I am using a 32 CPU machine, so slightly lower than what you told me to expect. I will try img2dataset again to get numbers.

tchaton commented 7 months ago
# main ones
- [Errno 101] Network is unreachable,
- [Errno 99] Cannot assign requested address
- [Errno -2] Name or service not known

# the rest
- [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'site.aimbulance.com'. (_ssl.c:997)
tchaton commented 7 months ago

Interestingly, the ratio with img2dataset is quite lower:

worker  - success: 0.105 - failed to download: 0.894 - failed to resize: 0.001 - images per sec: 308 - count: 10000
total   - success: 0.082 - failed to download: 0.918 - failed to resize: 0.000 - images per sec: 2979 - count: 96608
worker  - success: 0.146 - failed to download: 0.854 - failed to resize: 0.000 - images per sec: 313 - count: 10000
total   - success: 0.088 - failed to download: 0.912 - failed to resize: 0.000 - images per sec: 3288 - count: 106608
worker  - success: 0.128 - failed to download: 0.872 - failed to resize: 0.000 - images per sec: 300 - count: 10000
total   - success: 0.091 - failed to download: 0.909 - failed to resize: 0.000 - images per sec: 3497 - count: 116608
worker  - success: 0.124 - failed to download: 0.876 - failed to resize: 0.000 - images per sec: 311 - count: 10000
total   - success: 0.094 - failed to download: 0.906 - failed to resize: 0.000 - images per sec: 3797 - count: 126608
worker  - success: 0.174 - failed to download: 0.825 - failed to resize: 0.001 - images per sec: 343 - count: 10000
total   - success: 0.100 - failed to download: 0.900 - failed to resize: 0.000 - images per sec: 4097 - count: 136608
worker  - success: 0.159 - failed to download: 0.840 - failed to resize: 0.001 - images per sec: 178 - count: 5536
total   - success: 0.102 - failed to download: 0.898 - failed to resize: 0.000 - images per sec: 4263 - count: 142144
worker  - success: 0.090 - failed to download: 0.909 - failed to resize: 0.001 - images per sec: 317 - count: 10000
total   - success: 0.101 - failed to download: 0.898 - failed to resize: 0.000 - images per sec: 4563 - count: 152144
worker  - success: 0.149 - failed to download: 0.851 - failed to resize: 0.000 - images per sec: 313 - count: 10000
total   - success: 0.104 - failed to download: 0.896 - failed to resize: 0.000 - images per sec: 4863 - count: 162144
worker  - success: 0.082 - failed to download: 0.918 - failed to resize: 0.000 - images per sec: 305 - count: 10000
total   - success: 0.103 - failed to download: 0.897 - failed to resize: 0.000 - images per sec: 5163 - count: 172144
worker  - success: 0.120 - failed to download: 0.880 - failed to resize: 0.000 - images per sec: 304 - count: 10000
total   - success: 0.104 - failed to download: 0.896 - failed to resize: 0.000 - images per sec: 5463 - count: 182144
worker  - success: 0.102 - failed to download: 0.897 - failed to resize: 0.001 - images per sec: 316 - count: 10000
total   - success: 0.104 - failed to download: 0.896 - failed to resize: 0.000 - images per sec: 5763 - count: 192144
worker  - success: 0.099 - failed to download: 0.901 - failed to resize: 0.000 - images per sec: 305 - count: 10000
total   - success: 0.104 - failed to download: 0.896 - failed to resize: 0.000 - images per sec: 6063 - count: 202144
worker  - success: 0.194 - failed to download: 0.806 - failed to resize: 0.000 - images per sec: 318 - count: 10000
total   - success: 0.108 - failed to download: 0.892 - failed to resize: 0.000 - images per sec: 6363 - count: 212144
worker  - success: 0.152 - failed to download: 0.848 - failed to resize: 0.000 - images per sec: 308 - count: 10000
total   - success: 0.110 - failed to download: 0.890 - failed to resize: 0.000 - images per sec: 6644 - count: 222144

{
    "count": 10000,
    "successes": 900,
    "failed_to_download": 9093,
    "failed_to_resize": 7,
    "duration": 31.51988196372986,
    "start_time": 1707166867.7824914,
    "end_time": 1707166899.3023734,
    "status_dict": {
        "<urlopen error [Errno -2] Name or service not known>": 996,
        "<urlopen error [Errno -3] Temporary failure in name resolution>": 31,
        "success": 900,
        "Image decoding error": 7,
        "HTTP Error 404: Not Found": 105,
        "timed out": 1,
        "HTTP Error 403: Forbidden": 23,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for '0.realtorpage.io'. (_ssl.c:997)>": 1,
        "<urlopen error [Errno 99] Cannot assign requested address>": 7936
    }
}```
tchaton commented 7 months ago

Hey @rom1504 I found this interesting issue: https://github.com/pola-rs/polars/issues/14358. I need to add profiling. But it seems you got around this by creating shards from the parquet files to optimize the distribution: https://github.com/rom1504/img2dataset/blob/main/img2dataset/reader.py#L189.

This is a great idea. I am going to try this out.

tchaton commented 7 months ago

Hey @rom1504 I started a distributed Job on 32 nodes to download the dataset. This is my first test run. I will keep you updated.

Screenshot 2024-02-09 at 15 57 50
SomnusQue commented 7 months ago

Sorry to bother you. Could you tell me how to download laion400M dataset? I use this code try to download:img2dataset --url_list laion400m-meta --input_format "parquet" --url_col "URL" --caption_col "TEXT" --output_format webdataset --output_folder laion400m-data --processes_count 16 --thread_count 128 --image_size 256 --save_additional_columns '["NSFW","similarity","LICENSE"]' --enable_wandb True, but sth wrong happened.

tchaton commented 7 months ago

Hey @SomnusQue Here is the full blogpost explaining how to download the dataset: lightning.ai/lightning-ai/studios/download-stream-400m-images-text~01hg0zg8fyybp7p1sma6g9dkzm.

@rom1504 I would appreciate if you could have a read and give me your thoughts.