Train failed: inconsistent sequence length

linkAmy commented 3 years ago

Hi, I am trying to run this work on my own dataset. After preprocessing the dataset, I got train.1.pt/valid.1.pt/vocab.pt. I notice that I do not have the train-roto-ptr.txt so I followed the isssue#26 and commented out all the related codes that used the supervision through pointers. And then, I met the same problem in this issue#28. So I reset the TGT_VOCAB_SIZE and finally, I got this error message:

Traceback (most recent call last): File "train.py", line 454, in main() File "train.py", line 446, in main train_model(model1, model2, fields, optim1, optim2, data_type, model_opt) File "train.py", line 256, in train_model train_stats, train_stats2 = trainer.train(train_iter, epoch, report_func) File "/data2text-plan-py-master/onmt/Trainer.py", line 164, in train for i, batch in enumerate(train_iter): File "train.py", line 137, in iter for batch in self.cur_iter: File "/anaconda2/lib/python2.7/site-packages/torchtext/data/iterator.py", line 151, in iter self.train) File "/anaconda2/lib/python2.7/site-packages/torchtext/data/batch.py", line 27, in init setattr(self, name, field.process(batch, device=device, train=train)) File "/data2text-plan-py-master/onmt/io/BoxField.py", line 134, in process tensor = self.numericalize(padded, device=device, train=train) File "/data2text-plan-py-master/onmt/io/BoxField.py", line 253, in numericalize arr = self.tensor_type(arr) RuntimeError: inconsistent sequence length at index (3, 20) - expected 21 but got 20

I am trying to figure it out, probably it has something to do with the TextDataset method get_fields as it is related to BoxField:

@staticmethod
    def get_fields(n_src_features, n_tgt_features):
        """
        Args:
            n_src_features (int): the number of source features to
                create `torchtext.data.Field` for.
            n_tgt_features (int): the number of target features to
                create `torchtext.data.Field` for.

        Returns:
            A dictionary whose keys are strings and whose values
            are the corresponding Field objects.
        """
        fields = {}

        fields["src1"] = BoxField(
            sequential=False,
            init_token=BOS_WORD,
            eos_token=EOS_WORD,
            pad_token=PAD_WORD)

But I have no idea how to fix it.

ratishsp commented 3 years ago

Hi @linkAmy I am not sure why you get this error. I assume you have used the correct torchtext version (torchtext==0.2.3) Does printing the values of arr in https://github.com/ratishsp/data2text-plan-py/blob/4b7453530f570aefe036292f1219bbcf8851ad9f/onmt/io/BoxField.py#L251 help in debugging the issue?

linkAmy commented 3 years ago

Thanks a lot for your nice reply! @ratishsp This error seems most likely due to my dataset as the code works well with rotowire dataset. I will double check my dataset. Thanks again!

wingedRuslan commented 3 years ago

Hey @linkAmy,

Have you been able to eventually run the code on your own dataset?

I have the same goal as you, and since you referenced my two issues, I suppose you have same problems as I have in running the code on the dataset other than RotoWire.

TongLi3701 commented 3 years ago

Hi @wingedRuslan ,

I have successfully run the code on my dataset. The reason is because the length of the input should be the same for both training and testing, so probably you need to add some padding for your dataset.

wingedRuslan commented 3 years ago

Hi @TongLi3701,

that's great news! Thanks a lot for posting it here :+1: