mlfoundations / open_clip

An open source implementation of CLIP.
Other
9.14k stars 908 forks source link

CoCa RoBERTa Attention Map Size Issue #864

Closed sandeepmukh closed 1 week ago

sandeepmukh commented 2 months ago

Hi! I'm trying to train CoCa using the pretrained RoBERTa weights (has the casual masking issue #445 been addressed?), but I am running into an error with the Attention Maps sizes. Any help would be greatly appreciated :).

Below is the command I'm running:

torchrun --nproc_per_node 4 m training.main \
         --train-data="$COYO_PATH/train" \
         --train-num-samples 3000000 \
         --val-data="$COYO_PATH/val" \
         --val-num-samples 10000 \
         --dataset-type webdataset \
         --batch-size 128 \
         --warmup 2000 \
         --epochs 100 \
         --lr 5e-4 \
         --precision amp \
         --workers 6 \
         --model "coca_roberta-ViT-B-32" \
         --name "coca_coyo" \
         --report-to "wandb" \
         --wandb-project-name "open-clip-baseline" \
         --imagenet-val "$IMAGENET_HOME/validation" \
         --gather-with-grad \
         --local-loss \

However, this errors:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "src/training/main.py", line 508, in <module>
    main(sys.argv[1:])
  File "src/training/main.py", line 436, in main
    train_one_epoch(model, data, loss, epoch, optimizer, scaler, scheduler, dist_model, args, tb_writer=writer)
  File "src/training/train.py", line 101, in train_one_epoch
    model_out = model(images, texts)
 ... (omitted for brevity)
  File ".venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File ".venv/lib/python3.10/site-packages/torch/nn/modules/activation.py", line 1241, in forward
    attn_output, attn_output_weights = F.multi_head_attention_forward(
  File ".venv/lib/python3.10/site-packages/torch/nn/functional.py", line 5354, in multi_head_attention_forward
    raise RuntimeError(f"The shape of the 2D attn_mask is {attn_mask.shape}, but should be {correct_2d_size}.")
RuntimeError: The shape of the 2D attn_mask is torch.Size([76, 76]), but should be (77, 77).

Inspecting the error, I tried to change the multi-modal context length to 77, which yields the following error:

../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [38,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed.
rwightman commented 2 months ago

@sandeepmukh I think a few things wrong for this .... first, update to main branch.

Then, I think this is needed in CocaModel to replace current vocab_size logic btw text and multimodal text towers

        if getattr(text_cfg, "hf_model_name", None) is not None:
            vocab_size = getattr(self.text, "vocab_size", text_cfg.vocab_size)
        else:
            vocab_size = text_cfg.vocab_size

Also, the context_len used by tokenzier sources from text_cfg by default, so text_cfg and multimodal_cfg should have same context_len values in config (I think) to work best but I'm not 100% sure there.