Closed linkAmy closed 3 years ago
Hi @linkAmy I am not sure why you get this error. I assume you have used the correct torchtext version (torchtext==0.2.3) Does printing the values of arr in https://github.com/ratishsp/data2text-plan-py/blob/4b7453530f570aefe036292f1219bbcf8851ad9f/onmt/io/BoxField.py#L251 help in debugging the issue?
Thanks a lot for your nice reply! @ratishsp This error seems most likely due to my dataset as the code works well with rotowire dataset. I will double check my dataset. Thanks again!
Hey @linkAmy,
Have you been able to eventually run the code on your own dataset?
I have the same goal as you, and since you referenced my two issues, I suppose you have same problems as I have in running the code on the dataset other than RotoWire.
Hi @wingedRuslan ,
I have successfully run the code on my dataset. The reason is because the length of the input should be the same for both training and testing, so probably you need to add some padding for your dataset.
Hi @TongLi3701,
that's great news! Thanks a lot for posting it here :+1:
Hi @wingedRuslan , could you tell me how to solve the problem? I also have the same problem.
Hello @TongLi3701, could you tell me how to pad the dataset such that the length can be the same?
Hello @TongLi3701, could you tell me how to pad the dataset such that the length can be the same?
Hi, it will depend on your tokenizer and word embedding, if the library you use has some parameters such as "zero_padding" then you can turn it on, in addition, we can also add "0" as padding, I think most of the word embedding will use "0" as padding.
Hi, Thanks for your response, my major problem is that I don't know where I should add the padding code. I try to input the code in the File "train.py", line 137, but I fail. I'm very confused about the problem. I don't change any code in the file, can you help me? Thank you a lot.
I think you are not changing any code in train.py
, you will need to rewrite preprocess.py
file depends on your own dataset and you need to control the padding there.
If your preprocessing part works correctly, you will get some files such as train.1.pt/valid.1.pt/vocab.pt
.
I've already gotten the train.1.pt/valid.1.pt/vocab.pt without rewriting the preprocess.py. You mean that I need to rewrite the preprocess.py and output the train.1.pt/valid.1.pt/vocab.pt again?
Yes, you will need to add some paddings in pre-processing code. If you are using the author's dataset, then I think you will not need to change anything, cuz the size of the data is all the same, but if you are using your own dataset, you will need to modify some codes to deal with the size.
Thank you a lot. I'm going to try.
No worries. Good luck
Hi, I am trying to run this work on my own dataset. After preprocessing the dataset, I got train.1.pt/valid.1.pt/vocab.pt. I notice that I do not have the train-roto-ptr.txt so I followed the isssue#26 and commented out all the related codes that used the supervision through pointers. And then, I met the same problem in this issue#28. So I reset the TGT_VOCAB_SIZE and finally, I got this error message:
I am trying to figure it out, probably it has something to do with the TextDataset method
get_fields
as it is related toBoxField
:But I have no idea how to fix it.