Get error on Train and validate MART on COOT embeddings

DesaleF commented 3 years ago

First of all Thank you for doing this amazing work. I am trying to run inference on my own dataset but first I want to check the code if it works fine for me. Now I am trying to run validation of video captioning with the mart model. Everything works fine except, python train_caption.py -c config/caption/paper2020/yc2_100m_coot_vidclip_mart.yaml --validate --load_model provided_models_caption/yc2_100m_coot_vidclip_mart.pth

I am getting the following error, I tried to fix the problem but I am not able to fix it. So is there anything that you suggest me to do? by the way I also tried your suggesting to troubleshoot or deal with some bugs, but they don't work to solve this issue.

  Dataset youcook2 #457 val input coot_emb
  Model: RecursiveTransformer
  Parameters total: 24045754, frozen: 0
  Parameters total: 297600, frozen: 0
  Logger: 'trainlog' to experiments/caption/default/yc2_100m_coot_vidclip_mart_run1/logs/run_2021_03_18_10_54_50.log
   INFO Running on cuda: True, multi-gpu: False, gpus found: 8, fp16 amp: False.
   INFO Random seed: 27996
   INFO Loading model from checkpoint file provided_models_caption/yc2_100m_coot_vidclip_mart.pth
  Traceback (most recent call last):
    File "train_caption.py", line 99, in <module>
      main()
    File "train_caption.py", line 87, in main
      trainer.validate_epoch(val_loader)
    File "/root/miniconda3/envs/coot/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
      return func(*args, **kwargs)
    File "/home/coot-videotext/mart/trainer_caption.py", line 428, in validate_epoch
      self.ema.assign(self.model)
    File "/home/coot-videotext/mart/optimization.py", line 229, in assign
      assert name in self.shadow
  AssertionError

simon-ging commented 3 years ago

Hard to see what's happening here without any code. First line says you are still validating on the youcook2 dataset with 457 datapoints, not sure this is what you want as you said you use a new dataset.

You can try disable ema in the config with ema_decay: 0 or deleting the ema.assign line, its not necessary to use exponential moving average of weights, you can just use regular weights.

DesaleF commented 3 years ago

Hello @gingsi ! I actually did not change anything in the code. I followed the video captioning section of the readme. Also I am trying on youcook2 dataset just to see if everything works with youcook2 dataset. Then I will change into my own dataset using similar pipeline. I also disabled ema but still the same problem.

simon-ging commented 3 years ago

Actually, this was a bug in the code. The problem was that when loading fixed model weights with --load_model there is no EMA that can be loaded, so it needs to be disabled it for those cases.

I just pushed a fix and now the command works on my end. Thanks for notifying. Please check if it works on your end and let me know if you have further problems.

DesaleF commented 3 years ago

Now it works properly. Thank you for the quick fix!

DesaleF commented 3 years ago

Hello @gingsi I was doing a little bit research on the code. I found some bug related to ema during training for captioning

when I run python train_caption.py -c config/caption/paper2020/yc2_100m_coot_vidclip_mart.yaml

I got this error

  Traceback (most recent call last):
    File "train_caption.py", line 99, in <module>
      main()
    File "train_caption.py", line 90, in main
      trainer.train_model(train_loader, val_loader)
    File "/home/dfentaw/video_captioning/coot-videotext/mart/trainer_caption.py", line 393, in train_model
      th.save(self.ema.state_dict(), str(ema_file))
  AttributeError: 'NoneType' object has no attribute 'state_dict'

I am not sure if it is the right fix but I did the following change and it is working fine for now. mart/trainer_caption.py

  if self.cfg.ema_decay:
                  th.save(self.ema.state_dict(), str(ema_file))
  else:
      th.save(self.model.state_dict(), str(ema_file))

simon-ging / coot-videotext

Get error on Train and validate MART on COOT embeddings #23