RuntimeError: Expected 3-dimensional input for 3-dimensional weight [512, 1, 10], but got 4-dimensional input of size [1, 1, 72, 1011] instead

laleye commented 2 years ago

I'm trying to reuse your interesting code for speech translation on my own data. I get the following size error with lna_ed configuration:

Traceback (most recent call last):                                                                    
  File "lib/python3.8/site-packages/fairseq-1.0.0a0+88dba0a-py3.8-linux-x86_64.egg/fairseq_cli/hydra_train.py", line 45, in hydra_main
    distributed_utils.call_main(cfg, pre_main)
  File "lib/python3.8/site-packages/fairseq-1.0.0a0+88dba0a-py3.8-linux-x86_64.egg/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "lib/python3.8/site-packages/fairseq-1.0.0a0+88dba0a-py3.8-linux-x86_64.egg/fairseq_cli/train.py", line 169, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "lib/python3.8/site-packages/fairseq-1.0.0a0+88dba0a-py3.8-linux-x86_64.egg/fairseq_cli/train.py", line 279, in train
    log_output = trainer.train_step(samples)
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "lib/python3.8/site-packages/fairseq-1.0.0a0+88dba0a-py3.8-linux-x86_64.egg/fairseq/trainer.py", line 694, in train_step
    raise e
  File "lib/python3.8/site-packages/fairseq-1.0.0a0+88dba0a-py3.8-linux-x86_64.egg/fairseq/trainer.py", line 662, in train_step
    loss, sample_size_i, logging_output = self.task.train_step(
  File "lib/python3.8/site-packages/fairseq-1.0.0a0+88dba0a-py3.8-linux-x86_64.egg/fairseq/tasks/fairseq_task.py", line 475, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "lib/python3.8/site-packages/fairseq-1.0.0a0+88dba0a-py3.8-linux-x86_64.egg/fairseq/criterions/label_smoothed_cross_entropy.py", line 79, in forward
    net_output = model(**sample["net_input"])
  File "lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/frejus/Projects/tafsiri-st/iwslt-2021/fairseq_modules/models/wav2vec_s2t.py", line 150, in forward
    encoder_out = self.encoder(
  File "lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/frejus/Projects/tafsiri-st/iwslt-2021/fairseq_modules/models/wav2vec_s2t.py", line 218, in forward
    encoder_out = super().forward(
  File "lib/python3.8/site-packages/fairseq-1.0.0a0+88dba0a-py3.8-linux-x86_64.egg/fairseq/models/wav2vec/wav2vec2_asr.py", line 372, in forward
    x, padding_mask = self.w2v_model.extract_features(**w2v_args)
  File "lib/python3.8/site-packages/fairseq-1.0.0a0+88dba0a-py3.8-linux-x86_64.egg/fairseq/models/wav2vec/wav2vec2.py", line 631, in extract_features
    res = self.forward(source, padding_mask, mask=mask, features_only=True)
  File "lib/python3.8/site-packages/fairseq-1.0.0a0+88dba0a-py3.8-linux-x86_64.egg/fairseq/models/wav2vec/wav2vec2.py", line 486, in forward
    features = self.feature_extractor(source)
  File "lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "lib/python3.8/site-packages/fairseq-1.0.0a0+88dba0a-py3.8-linux-x86_64.egg/fairseq/models/wav2vec/wav2vec2.py", line 741, in forward
    x = conv(x)
  File "lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "lib/python3.8/site-packages/torch/nn/modules/conv.py", line 301, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "lib/python3.8/site-packages/torch/nn/modules/conv.py", line 297, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Expected 3-dimensional input for 3-dimensional weight [512, 1, 10], but got 4-dimensional input of size [1, 1, 72, 1011] instead

Do you know what I'm doing wrong?

johntsi commented 2 years ago

Hi, maybe your data are not in the correct format?

The input to the model has to be single-channel and sampled at 16kHz. You can convert them with the following command:

ls ${path_to_wavs}/*.* | parallel -j 4 ffmpeg -i {} -ac 1 -ar 16000 -hide_banner -loglevel error {.}.wav

laleye commented 2 years ago

Thank for your reply. All data was already in this format, however I still converted again but it remained without success. I always have the same error.

johntsi commented 2 years ago

Could you maybe try with a standard dataset like MuST-C, to see whether the problem is in the data?

laleye commented 2 years ago

@johntsi I will try it and let you know.

mt-upc / iwslt-2021

RuntimeError: Expected 3-dimensional input for 3-dimensional weight [512, 1, 10], but got 4-dimensional input of size [1, 1, 72, 1011] instead #2