Size of sample is invalid since max_positions=(1024, 1024)

saichandrapandraju commented 3 years ago

Hi @wasiahmad , I trained PLBART for JAVA -> PYTHON translation. But while testing, I was getting below error -

2021-07-21 05:31:11 | INFO | train | {"epoch": 30, "train_loss": "2.69", "train_nll_loss": "0.723", "train_ppl": "1.65", "train_wps": "7795.4", "train_ups": "0.35", "train_wpb": "22402", "train_bsz": "58.2", "train_num_updates": "240", "train_lr": "1.2e-05", "train_gnorm": "0.607", "train_train_wall": "5", "train_wall": "638"}
2021-07-21 05:31:11 | INFO | fairseq_cli.train | done training in 637.0 seconds
Traceback (most recent call last):
  File "/home/jovyan/.local/bin/fairseq-generate", line 8, in <module>
    sys.exit(cli_main())
  File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq_cli/generate.py", line 379, in cli_main
    main(args)
  File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq_cli/generate.py", line 41, in main
    return _main(args, sys.stdout)
  File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq_cli/generate.py", line 132, in _main
    itr = task.get_batch_iterator(
  File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq/tasks/fairseq_task.py", line 227, in get_batch_iterator
    indices = self.filter_indices_by_size(
  File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq/tasks/fairseq_task.py", line 137, in filter_indices_by_size
    raise Exception(
Exception: Size of sample #81 is invalid (=(1024, 1045)) since max_positions=(1024, 1024), skip this example with --skip-invalid-size-inputs-valid-test

I didn't understand what (1024, 1045) and (1024, 1024) mean. I'm using default 510 for training and 9999 for testing as below -

if [[ $SPLIT == 'test' ]]; then
        MAX_LEN=9999 # we do not truncate test sequences
    else
        MAX_LEN=510

Could you plz suggest how to proceed further..?

wasiahmad commented 3 years ago

Which fairseq version are you using? I am not sure but it seems like fairseq is complaining about target length. My guess is (=(1024, 1045)) means the source length is 1024 and the target length is 1045. Can you try truncating the target? As you can below,

fairseq-generate $PATH_2_DATA/data-bin \
    --user-dir $USER_DIR \
    --path $model \
    --truncate-source \
    --task translation_without_lang_token \
    --gen-subset test \
    -t $TARGET -s $SOURCE \
    --sacrebleu \
    --remove-bpe 'sentencepiece' \
    --max-len-b 200 \
    --beam 5 \
    --batch-size 4 \
    --langs $langs > $FILE_PREF

We use --truncate-source which handles lengthy inputs. But I am not sure because, during preparing, we truncate the source to confine it in 512 lengths, in that case, fairseq cannot get a source of length 1024. Did you make any changes in our script?

saichandrapandraju commented 3 years ago

Hi @wasiahmad ,

Which fairseq version are you using?

I'm using 0.10.2

I think it has something to do with max_positions of BART. This suggests that maximum 1024 can be used for inference. For longer sequences (>1024) we have to modify embed_positions and finetune to adjust weights. I also saw below info from logs when I just cloned and ran the PLBART scripts without any changes -

1) 2021-07-23 06:47:29 | INFO | fairseq_cli.train | Namespace(activation_fn='gelu', adam_betas='(0.9, 0.98)'........ max_epoch=30, max_source_positions=1024, max_target_positions=1024,........

and 2)

2021-07-23 06:47:32 | INFO | fairseq_cli.train | BARTModel(
  (encoder): TransformerEncoder(
    (dropout_module): FairseqDropout()
    (embed_tokens): Embedding(50005, 768, padding_idx=1)
    (embed_positions): LearnedPositionalEmbedding(1026, 768, padding_idx=1)

After these observations, below changes worked for 'test'-

if [[ $SPLIT == 'test' ]]; then
        MAX_LEN=510

and

if [[ $SPLIT == 'test' ]]; then
        MAX_LEN=1022

So I think max cannot be more that 1024 for 'test' but I'm not sure about this. May be you can give your thoughts and confirm.

wasiahmad commented 3 years ago

One thing that we can do is, during preprocessing, we can use --only-source to only process the source file. In that case, during generation, fairseq won't load target data. We actually do not require processed test data.

wasiahmad commented 3 years ago

Closing this, feel free to open it if it is not solved yet.

wasiahmad / PLBART

Size of sample is invalid since max_positions=(1024, 1024) #14