Error at question generation pipeline - non matching answers with original text

Krak91 commented 4 years ago

My use case is testing this using a few sentences for a few random documents from the newsgroups dataset in each call. In many cases i run to this error: ...site-packages\question_generation\pipelines.py", line 142, in _prepare_inputs_for_qg_from_answers_hl ans_start_idx = sent.index(answer_text) ValueError: substring not found

It's actually line 140 - I've added a couple print statements to check what's the case.

What I can see there happening is that the function is checking whether the 'answer' (which was extracted FROM the input sentence in the QGPipeline._extract_answers()) actually EXISTS in the input sentence. I believe this check shouldn't even take place since the answer should be in the sentence. The real problem then is why do the answers sometimes not match themselves as they were in the original text? Also isn't there a better way around it rather than raising an error?

Examples of these values just before line 140 of pipelines.py is run:

Sent: The doctors are not sure exactly why the pain is persisting but feel some sort of nerve damage has occured and they have employed Tylenol 3 and soon Morphine to relieve the pain. Answer txt: Tylenol 3 and soon Morphine

The original text has 1 extra space which is not in the 'answer'

Sent: Fact is SCSI-2 controler chips allow near SCSI-2 speeds through a SCSI-1 device {As shown in the Mac Quadra} which skews some of the data of SCSI-1 vs IDE or ESDI test. Answer txt: SCSI-2 controler chips

Same as above

patil-suraj commented 4 years ago

Hi @Krak91 , thank you for reporting, I will take a look.

I believe this check shouldn't even take place since the answer should be in the sentence.

ans_start_idx = sent.index(answer_text) this call is needed to get the index of the answer span as we need to highlight the span.

Krak91 commented 4 years ago

I see. I think the problem is that the model removes the extra spaces from the answers.

patil-suraj commented 4 years ago

I see. I think the problem is that the model removes the extra spaces from the answers.

@Krak91 , yes, you are right, fix is easy

Krak91 commented 4 years ago

Found a new case where this occurs. (not from spaces). What caused it is probably encoding. I'm not sure if this should be addressed prior to using the model, or if the pipeline/tokenizers attempt to eliminate errors like this, hence reporting it for good measure :)

@ ans_start_idx = sent.index(answer_text) sent: The type of respiratory support for individuals with COVID‑19 related respiratory failure is being actively studied for people in the hospital, with some evidence that intubation can be avoided with a high flow nasal cannula or bi-level positive airway pressure. answer_text: COVID ⁇ 19

Krak91 commented 4 years ago

also this is more bizare, it capitalized the word:

sent: === Anti-cytokine storm === Cytokine release syndrome (CRS) can be a complication in the later stages of severe COVID-19. answer_text: Cytokine storm

patil-suraj commented 4 years ago

Hi @Krak91 , thank you reporting these issues, I totally ignored these cases,

from above examples it seems that text should be cleaned/pre-processed before tokenizing. For examples converting COVID‑19 to COVID 19, remove special characters etc.

Would be a good idea add a clean_fn argument where the user can supply a cleaning function which will be applied to the text before feeding it to the model ?

Things like === Anti-cytokine storm === such section titles should be removed. What do you think ? Will reopen the issue

patil-suraj commented 4 years ago

Also if you use the prepend format model you won't face this problem, as the answer is simply pre-pended. I'll also train t5-base both single and multitask using prepend format, to deal with such issues

Krak91 commented 4 years ago

Adding a cleaning function argument will let people know that there isn't one already in place - totally agree. One more thing that can take care of many cases is changing ans_start_idx = sent.index(answer_text) to ans_start_idx = sent.lower().index(answer_text.lower()). However, regarding the last example it would help if we could identify why it capitalized the token following the dash IF in fact it's using that token and not the same word from somewhere else in the sentence for some weird reason (in this case there is another instance of the same word which is capitalized).

Also, how to use the prepend format model? I'm not sure what that is.

patil-suraj commented 4 years ago

@sebastien-mcrae , @Krak91 ,

here ans extraction is done using text2text approach, so in some cases it might generate text which is not in the source text (happened very rarely in my experiments). If the text is cleaned and has some answer like info then this types of errors won't occur. I won't do it. has no ans like text which is why weird generation. I will check for exception to prevent this error.

Also now I'm thinking of doing answer extraction using BERT as span extraction task, which will avoid such issues.

patil-suraj commented 4 years ago

@Krak91 To use the prepend format just provide the prepend model and specify the format using the qg_format argument. Right now there's only one prepend format model (t5-small), hope to add more soon. Here's how you can use it

nlp = pipeline("question-generation", model="valhalla/t5-small-qg-prepend", qg_format="prepend")
nlp("42 is the answer to life, the universe and everything.")

JoshSilverb commented 4 years ago

I'm getting similar issues, but when I check to see what the error thrown at ans_start_idx = sent.index(answer_text) is, I get an answer_text that is a substring of a previous sentence in the text fed into the pipeline and not relevant to the sentence currently in sent.

patil-suraj commented 4 years ago

hi @JoshSilverb , thank you for reporting this! Can post the code/example to reproduce this ?

JoshSilverb commented 4 years ago

Hi, thanks for the quick response! I ran

nlp = pipeline("question-generation")

qna = nlp(summary)

where summary is the text in summary_black_death.txt, and got the substring not found error. I had pipelines.py print the answer_text and corresponding sent and got the following output (cut some cause it was getting kinda long):

answer_text: 1345

sent: The most authoritative account at the time came from the medical faculty in Paris in a report to the king of France that blamed the heavens, in the form of a conjunction of three planets in 1345 that caused a "great pestilence in the air".

...

answer_text: a "great pestilence in the air"
sent: In addition to the bubonic infection, others point to additional septicemic (a type of "blood poisoning") and pneumonic (an airborne plague that attacks the lungs before the rest of the body) forms of the plague, which lengthen the duration of outbreaks throughout the seasons and help account for its high mortality rate and additional recorded symptoms.

Traceback (most recent call last):
  File "end_to_end_v1.py", line 114, in <module>
    main()
  File "end_to_end_v1.py", line 100, in main
    qna = nlp(summary)
  File "/mnt/c/Users/joshs/Desktop/Zazraak/qg/e2e/question_generation/pipelines.py", line 60, in __call__
    qg_examples = self._prepare_inputs_for_qg_from_answers_hl(sents, answers)
  File "/mnt/c/Users/joshs/Desktop/Zazraak/qg/e2e/question_generation/pipelines.py", line 145, in _prepare_inputs_for_qg_from_answers_hl
    ans_start_idx = sent.index(answer_text)
ValueError: substring not found

It looks like the answer_text for the sentence that causes the error actually comes from the first sentence.

patil-suraj commented 4 years ago

Hi, @JoshSilverb yes, this is an issue. This answer extraction method doesn't seem to be effective when sentences are long. I will try to fix this by the end of this week. For now instead of searching the answer in the same sentence, I will search in the whole text. This will fix it temporally.

Let me know if you have any other ideas.

danielmoore19 commented 4 years ago

Greetings! I have been digging into this as well and thought I'd toss my findings into the ring as well This is my first time, so please bear with me :)

I am printing the sents and answers before it starts the loop. Normally it prints one answer for a sent:

['Language, also, far more dubiously, is meant to define the other--and, in this case, the other is refusing to be defined by a language that has never been able to recognize him.'] [['Language ']] [{'answer': 'Language', 'question': 'What is meant to define the other?'}]

but obviously here in the error it gave us three:

['Jazz, for example, is a very specific sexual term, as in jazz me, baby, but white people purified it into the Jazz Age.'] [['Jazz ', ' white ', ' Jazz Me, baby ']]

ValueError Traceback (most recent call last)
in () ----> 1 multi_w_t5_base("Jazz, for example, is a very specific sexual term, as in jazz me, baby, but white people purified it into the Jazz Age.") 2 frames in _prepare_inputs_for_qg_from_answers_hl(self, sents, answers) 142 answer_text = answer_text.strip() 143 --> 144 ans_start_idx = sent.index(answer_text) 145 146 sent = f"{sent[:ans_start_idx]} {answer_text} {sent[ans_start_idx + len(answer_text): ]}" ValueError: substring not found

so it seems it is not merely long sentences, since the previous sentence is longer than the latter. but maybe it is something in how it is defining answers? which would actually be great, imo. if you can generate three answers from a sentence, and then give me three questions, one for each answer, that is terrific. so it is the indexing that is throwing things off, if i understand correctly, because sent.index(answer_text) is not matching up (ie 1 sent and 3 answer_text).

nitishkmr005 commented 3 years ago

Hi Suraj,

Let me know if this issue is fixed?

vanya-r commented 3 years ago

The same issue, as I can say. In my variant problem was with self.ans_tokenizer.decode(ids, skip_special_tokens=False) for ids in outs wich generate <pad> at the start in each outputs.

KTG1 commented 2 years ago

Hello, is this problem fixed?

casafurix commented 2 years ago

hello...is the problem fixed?

bowenyan commented 1 year ago

The same issue, as I can say. In my variant problem was with self.ans_tokenizer.decode(ids, skip_special_tokens=False) for ids in outs which generate <pad> at the start in each outputs.

Changed "skip_special_tokens=True" works with me.

def _extract_answers(self, context): sents, inputs = self._prepare_inputs_for_ans_extraction(context) inputs = self._tokenize(inputs, padding=True, truncation=True)

    outs = self.ans_model.generate(
        input_ids=inputs['input_ids'].to(self.device), 
        attention_mask=inputs['attention_mask'].to(self.device), 
        max_length=32,
    )

    dec = [self.ans_tokenizer.decode(ids, skip_special_tokens=True) for ids in outs]
    answers = [item.split('<sep>') for item in dec]
    answers = [i[:-1] for i in answers]

    return sents, answers

patil-suraj / question_generation

Error at question generation pipeline - non matching answers with original text #11

['Jazz, for example, is a very specific sexual term, as in jazz me, baby, but white people purified it into the Jazz Age.'] [['Jazz ', ' white ', ' Jazz Me, baby ']]