patil-suraj / question_generation

Neural question generation using transformers
MIT License
1.11k stars 348 forks source link

ValueError: substring not found #22

Open vidyap-xgboost opened 4 years ago

vidyap-xgboost commented 4 years ago

I ran the following code on colab:

from pipelines import pipeline
nlp = pipeline("question-generation")

text = """FIO Labs is an independent, privately-owned company with a global reach.
With the agility of a startup and the ability of a conglomerate, we help
businesses understand and adopt Artificial Intelligence & Data Security
technologies in the right framework and help them stay aligned with their
strategic objectives."""

nlp(text)

and I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-fdf5d062d391> in <module>()
----> 1 nlp(text)

1 frames
/content/question_generation/pipelines.py in _prepare_inputs_for_qg_from_answers_hl(self, sents, answers)
    140                 answer_text = answer_text.strip()
    141 
--> 142                 ans_start_idx = sent.index(answer_text)
    143 
    144                 sent = f"{sent[:ans_start_idx]} <hl> {answer_text} <hl> {sent[ans_start_idx + len(answer_text): ]}"

ValueError: substring not found

I don't understand why this error comes when I gave proper text.

This also occurred with

text8 = """Prashanth Bandi is one of the highly regarded consultants in the IT world, he \
is a Technology Evangelist with 18 years of consulting experience dealing with \
diverse problems and delivering technology solutions to complex business \
challenges. His adaptive nature, perseverance and genuine passion for \
technology makes him the torch bearer of our company.
"""
patil-suraj commented 4 years ago

Hi @vidyap-xgboost , this is a known issue and I'm working on the fix, see issue #11 . Sorry for inconvenience. Will let you know when it's fixed

vidyap-xgboost commented 4 years ago

Thanks for the heads-up. Will look forward to the fix.

danielmoore19 commented 4 years ago

For the time being:

if answer_text in sent: ans_start_idx = sent.index(answer_text) else: continue

will allow you to bypass the ValueError by skipping mismatched answer/sent pairs. in a test, it did continue to match answer/sent pairs beyond where I was getting an error. it had been throwing an error after (3) matches, and with the if statement it completed the task with a total of (10) answer/question pairs. not a fix, but will let you process an entire, complex text.

i was using James Baldwin's essay, If Black English Isn't a Language, Then Tell Me, What Is?, specifically for the purpose of stress testing. you can find it here:

(https://archive.nytimes.com/www.nytimes.com/books/98/03/29/specials/baldwin-english.html?_r=1&oref=slogin)

(apologies for the code format, i couldn't get it to break by line)

ankitkr3 commented 4 years ago

Hi @vidyap-xgboost , this is a known issue and I'm working on the fix, see issue #11 . Sorry for inconvenience. Will let you know when it's fixed

Hi @patil-suraj is this issue fixed now ?

danielmoore19 commented 4 years ago

Hi @vidyap-xgboost , this is a known issue and I'm working on the fix, see issue #11 . Sorry for inconvenience. Will let you know when it's fixed

Hi @patil-suraj is this issue fixed now ?

i would suggest using the if statement to work around. the issue is that sometimes the answer order gets out of place, the answer span and answer are not exact (one is Capital and the other is capital), or it actually creates an answer that does not appear the text. one of these is an issue with ordering what comes out of the I/O, one can be fixed by adding a .lower() to sent.index() and answer_text, and the last is an issue inside the model itself. thus the if statement is the only thing that will bypass all three errors. using the .lower() with the if statement will ensure you still get answers that appear in the text, but do not match in capitalization.

added - there is a more rare occurrence where an (s) gets added or dropped from the answer span.

ronaldgevern commented 2 years ago

When you use string_object.index(substring), it looks for the occurrence of substring in the string_object. If substring is present, the method returns the index at which the substring is present, otherwise, it throws ValueError: substring not found.

Using Python’s “in” operator

The simplest and fastest way to check whether a string contains a substring or not in Python is the “in” operator . This operator returns true if the string contains the characters, otherwise, it returns false .

str="Hello, World!"
print("World" in str)//output is  True

Python “in” operator takes two arguments, one on the left and one on the right, and returns True if the left argument string is contained within the right argument string. It is important to note that the “in” operator is case sensitive i.e, it will treat the Uppercase characters and Lowercase characters differently.