KeyError: 'start_time' logger error 'start_time'

rom1504 commented 1 year ago

Can be reproduced with the colab

Not sure how that can happen given current code but there must be some edge case

r-tinn commented 1 year ago

I'm having this issue too across multiple environments

rom1504 commented 1 year ago

https://github.com/rom1504/clip-retrieval/blob/main/clip_retrieval/clip_inference/logger.py#L41 can happen if this branch runs before any call to the writer logger is done

Action item:

Add a check there to make sure stats is not an empty dict before saving
figure out in what case the write logger could get no call

rom1504 commented 1 year ago

https://github.com/rom1504/clip-retrieval/commit/c978a03ec54819a604c2d06b62db1f27ea71a217 probably caused by this commit

rom1504 commented 1 year ago

https://github.com/rom1504/clip-retrieval/blob/main/clip_retrieval/clip_inference/runner.py#L34 could happen if there is no shard to process for this runner

rom1504 commented 1 year ago

So probably we are generating empty partitions (cc @nousr )

rom1504 commented 1 year ago

Need to print calculate_partition_count output to try to figure out what's up

Could also add more checks in the code to get better errors

nousr commented 1 year ago

So probably we are generating empty partitions (cc @nousr )

anecdotal evidence:

i've checked and all the ones we generated for 2b-en were all ~1.89Gb.

is it possible that an empty partition could >0Gb?

rom1504 commented 1 year ago

No, i think in your case (laion2b on slurm) it's all working as expected, but it seems in the small scale case local distributor there is something off

Need more testing/printing to know what exactly, the example notebook reproduces the problem

On Tue, Nov 1, 2022, 00:44 zion @.***> wrote:

So probably we are generating empty partitions (cc @nousr https://github.com/nousr )

anecdotal evidence:

i've checked and all the ones we generated for 2b-en were all ~1.89Gb.

is it possible that an empty partition could >0Gb?

— Reply to this email directly, view it on GitHub https://github.com/rom1504/clip-retrieval/issues/199#issuecomment-1297821396, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437WXT3RQRAYWNP7G4Y3WGBKVTANCNFSM6AAAAAARSGJEHQ . You are receiving this because you authored the thread.Message ID: @.***>

Twenkid commented 1 year ago

Hello! I also encounter the error. I wondered if CPU/GPU type of env. could be a reason, but it's the same. https://github.com/rom1504/clip-retrieval/blob/main/notebook/clip-client-query-api.ipynb

!clip-retrieval inference  --input_dataset image_folder --output_folder embedding_folder

The number of samples has been estimated to be 3
Starting the worker
dataset is 12
Starting work on task 0
100%|████████████████████████████████████████| 354M/354M [00:01<00:00, 190MiB/s]
warming up with batch size 256 on cuda
/usr/local/lib/python3.7/dist-packages/clip_retrieval/load_clip.py:90: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  ../aten/src/ATen/native/BinaryOps.cpp:594.)
  model.encode_image(image_tensor)
done warming up in 11.390223741531372s
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:566: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/clip_retrieval/clip_inference/logger.py", line 137, in reader
    start_time_no_initial_load = min(start_time_no_initial_load, v["start_time"])
KeyError: 'start_time'
logger error 'start_time

rom1504 commented 1 year ago

Yeah this is a known bug. Will be fixed soon.

You can use the previous version in the mean time.

On Wed, Nov 2, 2022, 19:19 Todor Arnaudov @.***> wrote:

Hello! I also encounter the error. I wondered if CPU/GPU type of env. could be a reason, but it's the same. https://github.com/rom1504/clip-retrieval/blob/main/notebook/clip-client-query-api.ipynb

The number of samples has been estimated to be 3

Starting the worker

dataset is 12

Starting work on task 0

100%|████████████████████████████████████████| 354M/354M [00:01<00:00, 190MiB/s]

warming up with batch size 256 on cuda

/usr/local/lib/python3.7/dist-packages/clip_retrieval/load_clip.py:90: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:594.)

model.encode_image(image_tensor)

done warming up in 11.390223741531372s

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:566: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.

cpuset_checked))

Traceback (most recent call last):

File "/usr/local/lib/python3.7/dist-packages/clip_retrieval/clip_inference/logger.py", line 137, in reader
start_time_no_initial_load = min(start_time_no_initial_load, v["start_time"])
KeyError: 'start_time'

logger error 'start_time

— Reply to this email directly, view it on GitHub https://github.com/rom1504/clip-retrieval/issues/199#issuecomment-1301042453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437SHTIC6JTKUCPRZCNDWGKWERANCNFSM6AAAAAARSGJEHQ . You are receiving this because you authored the thread.Message ID: @.***>

rom1504 commented 1 year ago

Fix released as 2.35.1

yuvalkirstain commented 1 year ago

still happens to me :(

perhaps we can change it to:

for v in stats.values():
    if "start_time" in v:
        start_time_no_initial_load = min(start_time_no_initial_load, v["start_time"])

rom1504 commented 1 year ago

the bug was already fixed

you are probably encountering another issue causing this

we could improve the error though yes

rom1504 / clip-retrieval

KeyError: 'start_time' logger error 'start_time' #199