seq-to-mind / DMRST_Parser

One implementation of the paper "DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing".
32 stars 10 forks source link

Dimension out of range at pred = torch.argmax(outputs, dim=1) #1

Closed diegopaucarv closed 2 years ago

diegopaucarv commented 2 years ago

Hi,

Congrats on the great work. I've been using your trained model to classify arguments in over 800 news in both spanish and portuguese, I further modified the original MUL_main_Infer.py code to match my needs (the only file I modified). I made it capable of loading a group of texts in a directory one by one, splitting them into sentences, and then predicting the labels for each sentence to finally save the output as a single csv file.

I am predicting for 800 txt files now. It works very well until I add more than the first 2 files to the directory. However, the error seems to rise up whenever I choose to add all the files in it. According to the debugging file, the error is in the pred = torch.argmax(outputs, dim=1) line within the module.py file.

This is the error in question:

  File "C:\Users\diego\Escritorio\DMRST\MUL_main_Infer.py", line 114, in <module>
    input_sentences, all_segmentation_pred, all_tree_parsing_pred = inference(model, bert_tokenizer, Test_InputSentences, batch_size)
  File "C:\Users\diego\Escritorio\DMRST\MUL_main_Infer.py", line 63, in inference
    _, _, SPAN_batch, _, predict_EDU_breaks = model.TestingLoss(input_sen_batch, input_EDU_breaks=None, LabelIndex=None,
  File "C:\Users\diego\Escritorio\DMRST\model_depth.py", line 184, in TestingLoss
    EncoderOutputs, Last_Hiddenstates, _, predict_edu_breaks = self.encoder(input_sentence, input_EDU_breaks, is_test=use_pred_segmentation)
  File "C:\Users\diego\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\diego\Escritorio\DMRST\module.py", line 81, in forward
    predict_edu_breaks = self.segmenter.test_segment_loss(embeddings.squeeze())
  File "C:\Users\diego\Escritorio\DMRST\module.py", line 355, in test_segment_loss
    pred = torch.argmax(outputs, dim=1).detach().cpu().numpy().tolist()
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Can you please help me solving this issue? I would be extremely thankful for that.

seq-to-mind commented 2 years ago

This error might be caused by an invalid length of the input sequence, such as 0.

As you have done some steps before feeding the sequence (i.e. made it capable of loading a group of texts in a directory one by one, splitting them into sentences,), it will be good to check all text spans' length after your sentence splitting.

A general way to check the input is running on your files one-by-one, and separately check the one that causes an exception.

diegopaucarv commented 2 years ago

Hey! Thank you for the fast response. Yeah, after deleting some strange characters in specific files and tried a file by file debugging, the problem seems to lie in the presence of the "...", "¡¿", "(. . . )" and " 1. ", " 2. " characters. I don't really know why. (EDIT: It has something to do with the dots).

I've seen that the tokenization process sometimes tokenizes half words. I don't know if that's because of the lemmas/roots, or because they use "weird" characters (as in spanish ó, á, í and ñ). Would you advice me to do a data preprocessing before anything? Maybe I'm trusting too much in the roberta model.

seq-to-mind commented 2 years ago

For the sentence level segmentation, the rule-based tools such as nltk sometimes cannot work very well (especially on presence of the "...", "¡¿", "(. . . )" and " 1. ", " 2. " characters). One potential way is to try some model-based tools such as spaCy, but they will be slower (if you don't mind the speed).

If you want to know more about the tokenization conducted by roberta model, you can refer to some docs of sub-word tokenization methods (e.g., BPE, WordPiece, and SentencePiece).