Closed rom1504 closed 1 year ago
I'm having this issue too across multiple environments
https://github.com/rom1504/clip-retrieval/blob/main/clip_retrieval/clip_inference/logger.py#L41 can happen if this branch runs before any call to the writer logger is done
Action item:
https://github.com/rom1504/clip-retrieval/commit/c978a03ec54819a604c2d06b62db1f27ea71a217 probably caused by this commit
https://github.com/rom1504/clip-retrieval/blob/main/clip_retrieval/clip_inference/runner.py#L34 could happen if there is no shard to process for this runner
So probably we are generating empty partitions (cc @nousr )
Need to print calculate_partition_count output to try to figure out what's up
Could also add more checks in the code to get better errors
So probably we are generating empty partitions (cc @nousr )
anecdotal evidence:
i've checked and all the ones we generated for 2b-en were all ~1.89Gb.
is it possible that an empty partition could >0Gb?
No, i think in your case (laion2b on slurm) it's all working as expected, but it seems in the small scale case local distributor there is something off
Need more testing/printing to know what exactly, the example notebook reproduces the problem
On Tue, Nov 1, 2022, 00:44 zion @.***> wrote:
So probably we are generating empty partitions (cc @nousr https://github.com/nousr )
anecdotal evidence:
i've checked and all the ones we generated for 2b-en were all ~1.89Gb.
is it possible that an empty partition could >0Gb?
— Reply to this email directly, view it on GitHub https://github.com/rom1504/clip-retrieval/issues/199#issuecomment-1297821396, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437WXT3RQRAYWNP7G4Y3WGBKVTANCNFSM6AAAAAARSGJEHQ . You are receiving this because you authored the thread.Message ID: @.***>
Hello! I also encounter the error. I wondered if CPU/GPU type of env. could be a reason, but it's the same. https://github.com/rom1504/clip-retrieval/blob/main/notebook/clip-client-query-api.ipynb
!clip-retrieval inference --input_dataset image_folder --output_folder embedding_folder
The number of samples has been estimated to be 3
Starting the worker
dataset is 12
Starting work on task 0
100%|████████████████████████████████████████| 354M/354M [00:01<00:00, 190MiB/s]
warming up with batch size 256 on cuda
/usr/local/lib/python3.7/dist-packages/clip_retrieval/load_clip.py:90: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:594.)
model.encode_image(image_tensor)
done warming up in 11.390223741531372s
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:566: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/clip_retrieval/clip_inference/logger.py", line 137, in reader
start_time_no_initial_load = min(start_time_no_initial_load, v["start_time"])
KeyError: 'start_time'
logger error 'start_time
Yeah this is a known bug. Will be fixed soon.
You can use the previous version in the mean time.
On Wed, Nov 2, 2022, 19:19 Todor Arnaudov @.***> wrote:
Hello! I also encounter the error. I wondered if CPU/GPU type of env. could be a reason, but it's the same. https://github.com/rom1504/clip-retrieval/blob/main/notebook/clip-client-query-api.ipynb
The number of samples has been estimated to be 3
Starting the worker
dataset is 12
Starting work on task 0
100%|████████████████████████████████████████| 354M/354M [00:01<00:00, 190MiB/s]
warming up with batch size 256 on cuda
/usr/local/lib/python3.7/dist-packages/clip_retrieval/load_clip.py:90: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:594.)
model.encode_image(image_tensor)
done warming up in 11.390223741531372s
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:566: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/clip_retrieval/clip_inference/logger.py", line 137, in reader
start_time_no_initial_load = min(start_time_no_initial_load, v["start_time"])
KeyError: 'start_time'
logger error 'start_time
— Reply to this email directly, view it on GitHub https://github.com/rom1504/clip-retrieval/issues/199#issuecomment-1301042453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437SHTIC6JTKUCPRZCNDWGKWERANCNFSM6AAAAAARSGJEHQ . You are receiving this because you authored the thread.Message ID: @.***>
Fix released as 2.35.1
still happens to me :(
perhaps we can change it to:
for v in stats.values():
if "start_time" in v:
start_time_no_initial_load = min(start_time_no_initial_load, v["start_time"])
the bug was already fixed
you are probably encountering another issue causing this
we could improve the error though yes
Can be reproduced with the colab
Not sure how that can happen given current code but there must be some edge case