IndexError while recreating the MustC experiment(tgt_sizes is has a shape (0,)

balag59 commented 4 years ago

Hi, I'm trying to recreate the EN-IT experiment on the MustC corpus and ran into this issue while training: Traceback (most recent call last): File "train.py", line 367, in main(args) File "train.py", line 73, in main shard_id=args.distributed_rank, File "FBK-Fairseq-ST/fairseq/tasks/fairseq_task.py", line 96, in get_batch_iterator indices = dataset.ordered_indices() File "FBK-Fairseq-ST/fairseq/data/language_pair_dataset.py", line 250, in ordered_indices indices = indices[np.argsort(self.tgt_sizes[indices], kind='mergesort')] IndexError: index 216490 is out of bounds for axis 0 with size 0

It seems that the tgt_sizes np array has a shape (0,) so this is causing the issue. Could you please guide me on resolving this issue?Thanks!

balag59 commented 4 years ago

Update: I've looked into the .it files and these are empty so that would explain the error above. I'm sot sure why they are empty but I'll try the tokenization again

balag59 commented 4 years ago

It was a mistake in the tokenization so everything is fine now.

mattiadg / FBK-Fairseq-ST

IndexError while recreating the MustC experiment(tgt_sizes is has a shape (0,) #7