rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.75k stars 341 forks source link

Process hanging forever before the end #289

Open HugoLaurencon opened 1 year ago

HugoLaurencon commented 1 year ago

Hi. When trying to download many images, I often noticed that the job seemed to not make progress anymore around the end. It could remain less than 1% of the images to download, but nothing would be written in the logs for hours, and the job just doesn't finish so I have to kill it manually. Is there an option to automatically finish the job if I don't mind not downloading these last images that cause the process to hang? Thanks

zwsjink commented 1 year ago

same here, tried execute img2dataset command line and near the end (by monitoring network ,almost no receival) ,the process just would not exit:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 27344 23008 ? Ss Aug02 0:02 python3 /home/admin/img2dataset_wordir/oss_uploader.py root 6640 0.0 0.0 0 0 ? Z 10:37 0:00 [python3] root 7519 1.3 0.2 5987436 381776 ? Sl 10:38 0:21 /usr/bin/python3 /usr/local/bin/img2dataset /tmp/0174f175a8d04872 root 7586 0.0 0.0 14876 11300 ? S 10:38 0:00 /usr/bin/python3 -c from multiprocessing.resource_tracker import root 7587 0.6 0.1 2859996 200612 ? Sl 10:38 0:10 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 7588 19.3 0.8 19582796 1162308 ? Sl 10:38 5:01 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 7589 19.8 0.8 19576436 1117928 ? Sl 10:38 5:09 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 7590 10.4 0.8 19588684 1147524 ? Sl 10:38 2:43 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 7591 20.5 0.9 19582968 1312364 ? Sl 10:38 5:20 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 7592 17.2 0.9 19582052 1287668 ? Sl 10:38 4:29 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 7593 13.9 0.8 19582236 1163680 ? Sl 10:38 3:37 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 7594 20.1 0.8 19583208 1123748 ? Sl 10:38 5:14 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 7595 15.4 0.9 19577776 1242608 ? Sl 10:38 4:01 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 21257 0.0 0.0 0 0 ? Z 10:42 0:00 [python3] root 25042 6.0 0.5 19579944 678640 ? Sl 10:42 1:17 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 28048 0.4 0.1 2858944 200792 ? Sl 10:43 0:05 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 35885 0.3 0.1 2858968 200660 ? Sl 10:44 0:04 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 36838 0.0 0.0 0 0 ? Z 10:45 0:00 [python3] root 36908 0.0 0.0 0 0 ? Z 10:45 0:00 [python3] root 36944 0.0 0.0 0 0 ? Z 10:45 0:00 [python3] root 37049 0.0 0.0 0 0 ? Z 10:45 0:00 [python3] root 91781 580 0.0 2602164 101144 ? Rl 11:04 0:05 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main;

joon612 commented 1 year ago

Same issue here

rom1504 commented 1 year ago

Any information on what is different about your environment and causing this?

On Fri, Sep 8, 2023, 09:01 JXD @.***> wrote:

Same issue here

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/289#issuecomment-1711121382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437R5KCXHRHCANIO5BDLXZKYDZANCNFSM6AAAAAAXFQ57NU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

joon612 commented 1 year ago

Any information on what is different about your environment and causing this? On Fri, Sep 8, 2023, 09:01 JXD @.> wrote: Same issue here — Reply to this email directly, view it on GitHub <#289 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437R5KCXHRHCANIO5BDLXZKYDZANCNFSM6AAAAAAXFQ57NU . You are receiving this because you are subscribed to this thread.Message ID: @.>

I am downloading datacomp_1b in Azure Batch nodes, Ubuntu20 LTS, Size standard_d4a_v4.


img2dataset.download(
            url_list=str(metadata_dir),
            image_size=args.image_size,  //512
            output_folder=str(shard_dir),  
            processes_count=args.processes_count,  //4
            thread_count=args.thread_count,  //64
            resize_mode=args.resize_mode,  //keep_ratio_largest
            resize_only_if_bigger=not args.no_resize_only_if_bigger,  //not False
            encode_format=args.encode_format,  //jpg
            output_format=args.output_format,  //webdataset
            input_format="parquet",  
            url_col="url",
            caption_col="text",
            bbox_col=bbox_col,  //face_bboxes
            save_additional_columns=["uid"],
            number_sample_per_shard=10000,
            oom_shard_count=8,
            retries=args.retries,  //2
            enable_wandb=args.enable_wandb,  //False
            wandb_project=args.wandb_project,  //datacomp
        )
joon612 commented 1 year ago

When I'm downloading it, I have a 15% chance of being able to reproduce it.

joon612 commented 1 year ago

Found the possible reason: parquet file in shards was broken (cannot read).

coleridge72 commented 8 months ago

Same issue here and if I interrupt the hanging process the resulting data is unusable with the following error when feeding a torch-dataloader:

  File "/usr/lib/python3.8/tarfile.py", line 686, in read
    raise ReadError("unexpected end of data")
tarfile.ReadError: unexpected end of data

Does anyone know of any combination of parameters to prevent this hanging?

https://github.com/rom1504/img2dataset/issues/402

rom1504 commented 8 months ago

You can delete any partial tar by checking if they have a .json files next to them

On Sat, Mar 16, 2024, 3:55 PM coleridge @.***> wrote:

Same issue here and if I interrupt the hanging process the resulting data is unusable with the following error when feeding a torch-dataloader:

File "/usr/lib/python3.8/tarfile.py", line 686, in read raise ReadError("unexpected end of data") tarfile.ReadError: unexpected end of data

Does anyone know of any combination of parameters to prevent this hanging?

402 https://github.com/rom1504/img2dataset/issues/402

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/289#issuecomment-2002011021, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437W2PISZP5MK5PBH77DYYRMPTAVCNFSM6AAAAAAXFQ57NWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGAYTCMBSGE . You are receiving this because you commented.Message ID: @.***>

coleridge72 commented 8 months ago

hmm I can't see any json files, only:

> du -sh small_10k/*
2.9M    small_10k/00000.parquet
427M    small_10k/00000.tar

fwiw: this is the error I get on killing the script.. I tried running with processes_count=1 but still no luck UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown

  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.11/multiprocessing/pool.py", line 114, in worker
    task = get()
           ^^^^^
  File "/usr/lib/python3.11/multiprocessing/queues.py", line 364, in get
    with self._rlock:
  File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
rom1504 commented 8 months ago

Sounds like your issue is then completely different from the current issue which is about successful runs that get stuck at the end.

I advise you open a new issue and put more information about your environment, and command you are running

krishnansr commented 8 months ago

Same here. Observed hanging in the end while trying download laion400m

rom1504 commented 8 months ago

Please provide any information you have to help figure that out. One more person reporting the same thing does not help much

On Tue, Apr 2, 2024, 2:06 AM Sivaramakrishnan @.***> wrote:

Same here. Observed hanging in the end while trying download laion400m

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/289#issuecomment-2030823886, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437TTMFX5RDXHKU4TIADY3HY77AVCNFSM6AAAAAAXFQ57NWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZQHAZDGOBYGY . You are receiving this because you commented.Message ID: @.***>

krishnansr commented 8 months ago

Of course. I have a clone with custome changes that runs distributed download across a cluster of nodes. The output_format was 'files' for my usecase so didn't really have any tars.

Interestingly I had another job that downloaded coyo 700m but completed successfully without hanging.

I'm planning to get to the bottom of it with a small parquet file later this week.