Closed diegopaucarv closed 2 years ago
This error might be caused by an invalid length of the input sequence, such as 0.
As you have done some steps before feeding the sequence (i.e. made it capable of loading a group of texts in a directory one by one, splitting them into sentences,
), it will be good to check all text spans' length after your sentence splitting.
A general way to check the input is running on your files one-by-one, and separately check the one that causes an exception.
Hey! Thank you for the fast response. Yeah, after deleting some strange characters in specific files and tried a file by file debugging, the problem seems to lie in the presence of the "...", "¡¿", "(. . . )" and " 1. ", " 2. " characters. I don't really know why. (EDIT: It has something to do with the dots).
I've seen that the tokenization process sometimes tokenizes half words. I don't know if that's because of the lemmas/roots, or because they use "weird" characters (as in spanish ó, á, í and ñ). Would you advice me to do a data preprocessing before anything? Maybe I'm trusting too much in the roberta model.
For the sentence level segmentation, the rule-based tools such as nltk sometimes cannot work very well (especially on presence of the "...", "¡¿", "(. . . )" and " 1. ", " 2. " characters). One potential way is to try some model-based tools such as spaCy, but they will be slower (if you don't mind the speed).
If you want to know more about the tokenization conducted by roberta model, you can refer to some docs of sub-word tokenization methods (e.g., BPE, WordPiece, and SentencePiece).
Hi,
Congrats on the great work. I've been using your trained model to classify arguments in over 800 news in both spanish and portuguese, I further modified the original MUL_main_Infer.py code to match my needs (the only file I modified). I made it capable of loading a group of texts in a directory one by one, splitting them into sentences, and then predicting the labels for each sentence to finally save the output as a single csv file.
I am predicting for 800 txt files now. It works very well until I add more than the first 2 files to the directory. However, the error seems to rise up whenever I choose to add all the files in it. According to the debugging file, the error is in the pred = torch.argmax(outputs, dim=1) line within the module.py file.
This is the error in question:
Can you please help me solving this issue? I would be extremely thankful for that.