Open 18445864529 opened 11 months ago
This can be temporarily solved by either disabling gradient checkpointing (but the memory requirement increases dramatically) or using singe card training. But I got another error during the single card training process:
RuntimeError: Function 'SoftmaxBackward0' returned nan values in its 0th output.
which is caused by this line attn = attn.softmax(dim=-1)
in the forward function of eva_vit.py
It always happens after a certain amount of iterations (after 450/17710 in the first epoch)
I simply used the provided code and script for coco finetuning and I don't understand why I got all these errors. Could someone please help? @LiJunnan1992
self._wrapped_model._set_static_graph()
solved it for me.
when trying to finetune BLIP2 with
caption_coco_ft.yaml
, I got the following error:And after setting
find_unused_parameters=True
andTORCH_DISTRIBUTED_DEBUG=DETAIL
I got this traceback message:Could someone please offer some idea on how I can solve this?