Open HugoLaurencon opened 1 year ago
same here, tried execute img2dataset command line and near the end (by monitoring network ,almost no receival) ,the process just would not exit:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 27344 23008 ? Ss Aug02 0:02 python3 /home/admin/img2dataset_wordir/oss_uploader.py
root 6640 0.0 0.0 0 0 ? Z 10:37 0:00 [python3]
Same issue here
Any information on what is different about your environment and causing this?
On Fri, Sep 8, 2023, 09:01 JXD @.***> wrote:
Same issue here
— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/289#issuecomment-1711121382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437R5KCXHRHCANIO5BDLXZKYDZANCNFSM6AAAAAAXFQ57NU . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Any information on what is different about your environment and causing this? … On Fri, Sep 8, 2023, 09:01 JXD @.> wrote: Same issue here — Reply to this email directly, view it on GitHub <#289 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437R5KCXHRHCANIO5BDLXZKYDZANCNFSM6AAAAAAXFQ57NU . You are receiving this because you are subscribed to this thread.Message ID: @.>
I am downloading datacomp_1b in Azure Batch nodes, Ubuntu20 LTS, Size standard_d4a_v4.
img2dataset.download(
url_list=str(metadata_dir),
image_size=args.image_size, //512
output_folder=str(shard_dir),
processes_count=args.processes_count, //4
thread_count=args.thread_count, //64
resize_mode=args.resize_mode, //keep_ratio_largest
resize_only_if_bigger=not args.no_resize_only_if_bigger, //not False
encode_format=args.encode_format, //jpg
output_format=args.output_format, //webdataset
input_format="parquet",
url_col="url",
caption_col="text",
bbox_col=bbox_col, //face_bboxes
save_additional_columns=["uid"],
number_sample_per_shard=10000,
oom_shard_count=8,
retries=args.retries, //2
enable_wandb=args.enable_wandb, //False
wandb_project=args.wandb_project, //datacomp
)
When I'm downloading it, I have a 15% chance of being able to reproduce it.
Found the possible reason: parquet file in shards was broken (cannot read).
Same issue here and if I interrupt the hanging process the resulting data is unusable with the following error when feeding a torch-dataloader:
File "/usr/lib/python3.8/tarfile.py", line 686, in read
raise ReadError("unexpected end of data")
tarfile.ReadError: unexpected end of data
Does anyone know of any combination of parameters to prevent this hanging?
You can delete any partial tar by checking if they have a .json files next to them
On Sat, Mar 16, 2024, 3:55 PM coleridge @.***> wrote:
Same issue here and if I interrupt the hanging process the resulting data is unusable with the following error when feeding a torch-dataloader:
File "/usr/lib/python3.8/tarfile.py", line 686, in read raise ReadError("unexpected end of data") tarfile.ReadError: unexpected end of data
Does anyone know of any combination of parameters to prevent this hanging?
402 https://github.com/rom1504/img2dataset/issues/402
— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/289#issuecomment-2002011021, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437W2PISZP5MK5PBH77DYYRMPTAVCNFSM6AAAAAAXFQ57NWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGAYTCMBSGE . You are receiving this because you commented.Message ID: @.***>
hmm I can't see any json files, only:
> du -sh small_10k/*
2.9M small_10k/00000.parquet
427M small_10k/00000.tar
fwiw: this is the error I get on killing the script.. I tried running with processes_count=1
but still no luck
UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/multiprocessing/pool.py", line 114, in worker
task = get()
^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 364, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
Sounds like your issue is then completely different from the current issue which is about successful runs that get stuck at the end.
I advise you open a new issue and put more information about your environment, and command you are running
Same here. Observed hanging in the end while trying download laion400m
Please provide any information you have to help figure that out. One more person reporting the same thing does not help much
On Tue, Apr 2, 2024, 2:06 AM Sivaramakrishnan @.***> wrote:
Same here. Observed hanging in the end while trying download laion400m
— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/289#issuecomment-2030823886, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437TTMFX5RDXHKU4TIADY3HY77AVCNFSM6AAAAAAXFQ57NWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZQHAZDGOBYGY . You are receiving this because you commented.Message ID: @.***>
Of course. I have a clone with custome changes that runs distributed download across a cluster of nodes. The output_format was 'files' for my usecase so didn't really have any tars.
Interestingly I had another job that downloaded coyo 700m but completed successfully without hanging.
I'm planning to get to the bottom of it with a small parquet file later this week.
Hi. When trying to download many images, I often noticed that the job seemed to not make progress anymore around the end. It could remain less than 1% of the images to download, but nothing would be written in the logs for hours, and the job just doesn't finish so I have to kill it manually. Is there an option to automatically finish the job if I don't mind not downloading these last images that cause the process to hang? Thanks