ratishsp / data2text-plan-py

Code for AAAI 2019 paper on Data-to-Text Generation with Content Selection and Planning
163 stars 46 forks source link

Train failed: inconsistent sequence length #29

Closed linkAmy closed 3 years ago

linkAmy commented 3 years ago

Hi, I am trying to run this work on my own dataset. After preprocessing the dataset, I got train.1.pt/valid.1.pt/vocab.pt. I notice that I do not have the train-roto-ptr.txt so I followed the isssue#26 and commented out all the related codes that used the supervision through pointers. And then, I met the same problem in this issue#28. So I reset the TGT_VOCAB_SIZE and finally, I got this error message:

Traceback (most recent call last): File "train.py", line 454, in main() File "train.py", line 446, in main train_model(model1, model2, fields, optim1, optim2, data_type, model_opt) File "train.py", line 256, in train_model train_stats, train_stats2 = trainer.train(train_iter, epoch, report_func) File "/data2text-plan-py-master/onmt/Trainer.py", line 164, in train for i, batch in enumerate(train_iter): File "train.py", line 137, in iter for batch in self.cur_iter: File "/anaconda2/lib/python2.7/site-packages/torchtext/data/iterator.py", line 151, in iter self.train) File "/anaconda2/lib/python2.7/site-packages/torchtext/data/batch.py", line 27, in init setattr(self, name, field.process(batch, device=device, train=train)) File "/data2text-plan-py-master/onmt/io/BoxField.py", line 134, in process tensor = self.numericalize(padded, device=device, train=train) File "/data2text-plan-py-master/onmt/io/BoxField.py", line 253, in numericalize arr = self.tensor_type(arr) RuntimeError: inconsistent sequence length at index (3, 20) - expected 21 but got 20

I am trying to figure it out, probably it has something to do with the TextDataset method get_fields as it is related to BoxField:

@staticmethod
    def get_fields(n_src_features, n_tgt_features):
        """
        Args:
            n_src_features (int): the number of source features to
                create `torchtext.data.Field` for.
            n_tgt_features (int): the number of target features to
                create `torchtext.data.Field` for.

        Returns:
            A dictionary whose keys are strings and whose values
            are the corresponding Field objects.
        """
        fields = {}

        fields["src1"] = BoxField(
            sequential=False,
            init_token=BOS_WORD,
            eos_token=EOS_WORD,
            pad_token=PAD_WORD)

But I have no idea how to fix it.

ratishsp commented 3 years ago

Hi @linkAmy I am not sure why you get this error. I assume you have used the correct torchtext version (torchtext==0.2.3) Does printing the values of arr in https://github.com/ratishsp/data2text-plan-py/blob/4b7453530f570aefe036292f1219bbcf8851ad9f/onmt/io/BoxField.py#L251 help in debugging the issue?

linkAmy commented 3 years ago

Thanks a lot for your nice reply! @ratishsp This error seems most likely due to my dataset as the code works well with rotowire dataset. I will double check my dataset. Thanks again!

wingedRuslan commented 3 years ago

Hey @linkAmy,

Have you been able to eventually run the code on your own dataset?

I have the same goal as you, and since you referenced my two issues, I suppose you have same problems as I have in running the code on the dataset other than RotoWire.

TongLi3701 commented 3 years ago

Hi @wingedRuslan ,

I have successfully run the code on my dataset. The reason is because the length of the input should be the same for both training and testing, so probably you need to add some padding for your dataset.

wingedRuslan commented 3 years ago

Hi @TongLi3701,

that's great news! Thanks a lot for posting it here :+1:

happycjksh commented 3 years ago

Hi @wingedRuslan , could you tell me how to solve the problem? I also have the same problem.

happycjksh commented 3 years ago

Hello @TongLi3701, could you tell me how to pad the dataset such that the length can be the same?

TongLi3701 commented 3 years ago

Hello @TongLi3701, could you tell me how to pad the dataset such that the length can be the same?

Hi, it will depend on your tokenizer and word embedding, if the library you use has some parameters such as "zero_padding" then you can turn it on, in addition, we can also add "0" as padding, I think most of the word embedding will use "0" as padding.

happycjksh commented 3 years ago

Hi, Thanks for your response, my major problem is that I don't know where I should add the padding code. I try to input the code in the File "train.py", line 137, but I fail. I'm very confused about the problem. I don't change any code in the file, can you help me? Thank you a lot.

TongLi3701 commented 3 years ago

I think you are not changing any code in train.py, you will need to rewrite preprocess.py file depends on your own dataset and you need to control the padding there.

If your preprocessing part works correctly, you will get some files such as train.1.pt/valid.1.pt/vocab.pt.

happycjksh commented 3 years ago

I've already gotten the train.1.pt/valid.1.pt/vocab.pt without rewriting the preprocess.py. You mean that I need to rewrite the preprocess.py and output the train.1.pt/valid.1.pt/vocab.pt again?

TongLi3701 commented 3 years ago

Yes, you will need to add some paddings in pre-processing code. If you are using the author's dataset, then I think you will not need to change anything, cuz the size of the data is all the same, but if you are using your own dataset, you will need to modify some codes to deal with the size.

happycjksh commented 3 years ago

Thank you a lot. I'm going to try.

TongLi3701 commented 3 years ago

No worries. Good luck