[12:53:11.383037] NaN loss encountered. Skipping this batch.
Traceback (most recent call last):
File "train.py", line 262, in <module>
main(args)
File "train.py", line 230, in main
train_stats = train_one_epoch(
File "/media/localhost/E/projects/github/multi-modal/vision-language/LaVIN/engine.py", line 36, in train_one_epoch
for data_iter_step, (examples, labels, example_mask,images,indicators) in enumerate(metric_logger.log_every(data_loader, print_freq, header)):
File "/media/localhost/E/projects/github/multi-modal/vision-language/LaVIN/util/misc.py", line 154, in log_every
meters=str(self),
File "/media/localhost/E/projects/github/multi-modal/vision-language/LaVIN/util/misc.py", line 112, in __str__
"{}: {}".format(name, str(meter))
File "/media/localhost/E/projects/github/multi-modal/vision-language/LaVIN/util/misc.py", line 81, in __str__
global_avg=self.global_avg,
File "/media/localhost/E/projects/github/multi-modal/vision-language/LaVIN/util/misc.py", line 67, in global_avg
return self.total / self.count
ZeroDivisionError: float division by zero
Hello, It seems you use the Vicuna model as the pre-trained LLM. It's possible that you incorrectly load the vicuna7B model instead of the corresponding delta model, which results in a NAN loss all the time
script