huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
129.41k stars 25.67k forks source link

Question-Answering pipeline doesn't work anymore with long text #6144

Closed dipanjanS closed 3 years ago

dipanjanS commented 3 years ago

Transformers version: 3.0.2

The question-answering models don't seem to work anymore with long text, any reason why this is happening? I have tried with the default model in pipeline as well as with specific models.

e.g

Sample Code:

from transformers import pipeline

nlp_qa = pipeline('question-answering') # 1st try
nlp_qa = pipeline('question-answering', model='deepset/roberta-base-squad2') # 2nd try

context = """
Coronaviruses are a large family of viruses which may cause illness in animals or humans.  
In humans, several coronaviruses are known to cause respiratory infections ranging from the 
common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). 
The most recently discovered coronavirus causes coronavirus disease COVID-19.
COVID-19 is the infectious disease caused by the most recently discovered coronavirus. 
This new virus and disease were unknown before the outbreak began in Wuhan, China, in December 2019. 
COVID-19 is now a pandemic affecting many countries globally.
The most common symptoms of COVID-19 are fever, dry cough, and tiredness. 
Other symptoms that are less common and may affect some patients include aches 
and pains, nasal congestion, headache, conjunctivitis, sore throat, diarrhea, 
loss of taste or smell or a rash on skin or discoloration of fingers or toes. 
These symptoms are usually mild and begin gradually. 
Some people become infected but only have very mild symptoms.
Most people (about 80%) recover from the disease without needing hospital treatment. 
Around 1 out of every 5 people who gets COVID-19 becomes seriously ill and develops difficulty breathing. 
Older people, and those with underlying medical problems like high blood pressure, heart and lung problems, 
diabetes, or cancer, are at higher risk of developing serious illness.  
However, anyone can catch COVID-19 and become seriously ill.  
People of all ages who experience fever and/or  cough associated with difficulty breathing/shortness of breath, 
chest pain/pressure, or loss of speech or movement should seek medical attention immediately. 
If possible, it is recommended to call the health care provider or facility first, 
so the patient can be directed to the right clinic.
People can catch COVID-19 from others who have the virus. 
The disease spreads primarily from person to person through small droplets from the nose or mouth, 
which are expelled when a person with COVID-19 coughs, sneezes, or speaks. 
These droplets are relatively heavy, do not travel far and quickly sink to the ground. 
People can catch COVID-19 if they breathe in these droplets from a person infected with the virus.  
This is why it is important to stay at least 1 meter) away from others. 
These droplets can land on objects and surfaces around the person such as tables, doorknobs and handrails.  
People can become infected by touching these objects or surfaces, then touching their eyes, nose or mouth.  
This is why it is important to wash your hands regularly with soap and water or clean with alcohol-based hand rub.
Practicing hand and respiratory hygiene is important at ALL times and is the best way to protect others and yourself.
When possible maintain at least a 1 meter distance between yourself and others. 
This is especially important if you are standing by someone who is coughing or sneezing.  
Since some infected persons may not yet be exhibiting symptoms or their symptoms may be mild, 
maintaining a physical distance with everyone is a good idea if you are in an area where COVID-19 is circulating. 
"""

nlp_qa(context=context, question='What is a coronavirus ?')

Error Message:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-15-ddac1f9cb68e> in <module>()
----> 1 nlp_qa(context=context, question='What is a coronavirus ?')

1 frames
/usr/local/lib/python3.6/dist-packages/transformers/pipelines.py in <listcomp>(.0)
   1314                         ),
   1315                     }
-> 1316                     for s, e, score in zip(starts, ends, scores)
   1317                 ]
   1318 

KeyError: 0

This used to work before version 3 I remember, would really appreciate some help on this.

dipanjanS commented 3 years ago

Also if I look back at my code,

!pip install transformers==2.11.0

image

Still works for me with a larger context (same code as above). Any idea which is the default model being used there and if that would still work for transfomers 3.x ?

dipanjanS commented 3 years ago

@LysandreJik , @sshleifer would be great if you could look into this, assign this to the right folks.

LysandreJik commented 3 years ago

Assigned @mfuntowicz, the master of pipelines. He's in holidays right now, so I'll try to look into it in the coming days.

melaniebeck commented 3 years ago

It isn't just long contexts. I was running some QA on SQuAD2.0 and came across an instance where I received that error for a given context and question but the context is not that long.

from transformers import pipeline

model_path = "twmkn9/distilbert-base-uncased-squad2"

hfreader = pipeline('question-answering', model=model_path, tokenizer=model_path, device=0)

context = """
The Norman dynasty had a major political, cultural and military impact on 
medieval Europe and even the Near East. The Normans were famed for their 
martial spirit and eventually for their Christian piety, becoming exponents of 
the Catholic orthodoxy into which they assimilated. They adopted the 
Gallo-Romance language of the Frankish land they settled, their dialect 
becoming known as Norman, Normaund or Norman French, an important literary 
language. The Duchy of Normandy, which they formed by treaty with the French 
crown, was a great fief of medieval France, and under Richard I of Normandy was 
forged into a cohesive and formidable principality in feudal tenure. The 
Normans are noted both for their culture, such as their unique Romanesque 
architecture and musical traditions, and for their significant military 
accomplishments and innovations. Norman adventurers founded the Kingdom of 
Sicily under Roger II after conquering southern Italy on the Saracens and 
Byzantines, and an expedition on behalf of their duke, William the Conqueror, 
led to the Norman conquest of England at the Battle of Hastings in 1066. Norman 
cultural and military influence spread from these new European centres to the 
Crusader states of the Near East, where their prince Bohemond I founded the 
Principality of Antioch in the Levant, to Scotland and Wales in Great Britain, 
to Ireland, and to the coasts of north Africa and the Canary Islands.
"""

question2 = "Who assimilted the Roman language?"

hfreader(question=question2, context=context)

Error Message:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-144-45135f680e80> in <module>()
----> 1 hfreader(question=question2, context=context)

1 frames
/usr/local/lib/python3.6/dist-packages/transformers/pipelines.py in <listcomp>(.0)
   1314                         ),
   1315                     }
-> 1316                     for s, e, score in zip(starts, ends, scores)
   1317                 ]
   1318 

KeyError: 0

But if I changed the question and keep the same context, the pipeline completes the execution.

question1 = "Who was famed for their Christian spirit?"
hfreader(question=question1, context=context)

Output

{'answer': 'Normans', 'end': 127, 'score': 0.5337043597899815, 'start': 120}
dipanjanS commented 3 years ago

Thanks @melaniebeck for this, even I encountered this just earlier today. Would definitely be great if the team can figure out how these could be resolved in v3.x for transformers.

acul3 commented 3 years ago

i also encountered this issue (keyerror : 0)

it's not even long text (about 8-12 words length)

sometime it occured when i'm changing some word in the question with oov word

 rv = self.dispatch_request()
0|QA       |   File "/home/samsul/.local/lib/python3.6/site-packages/flask/app.py", line 1935, in dispatch_request
0|QA       |     return self.view_functions[rule.endpoint](**req.view_args)
0|QA       |   File "/home/samsul/question-answering/app.py", line 23, in search
0|QA       |     answer = nlp({'question': question,'context': context})
0|QA       |   File "/home/samsul/.local/lib/python3.6/site-packages/transformers/pipelines.py", line 1316, in __call__
0|QA       |     for s, e, score in zip(starts, ends, scores)
0|QA       |   File "/home/samsul/.local/lib/python3.6/site-packages/transformers/pipelines.py", line 1316, in <listcomp>
0|QA       |     for s, e, score in zip(starts, ends, scores)
0|QA       | KeyError: 0
LysandreJik commented 3 years ago

Hello! There has been a few fixes on the pipelines since version v3.0.2 came out. I can reproduce this issue on v3.0.1 and v3.0.2, but not on the master branch, as it has probably been fixed already.

Could you try installing from source (pip install git+https://github.com/huggingface/transformers) and let me know if that fixes your issue?

acul3 commented 3 years ago

hi @LysandreJik

seems the problem still occurred but now its keyerror 17

input

!pip install git+https://github.com/huggingface/transformers
from transformers import pipeline

nlp = pipeline('question-answering',model='a-ware/xlmroberta-squadv2',device=0)
nlp({'question': "siapa istri samsul?",'context': "nama saya samsul, saya adalah suami raisa"})

Error

/usr/local/lib/python3.6/dist-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
   1676                         ),
   1677                     }
-> 1678                     for s, e, score in zip(starts, ends, scores)
   1679                 ]
   1680 

/usr/local/lib/python3.6/dist-packages/transformers/pipelines.py in <listcomp>(.0)
   1676                         ),
   1677                     }
-> 1678                     for s, e, score in zip(starts, ends, scores)
   1679                 ]
   1680 

KeyError: 17

i also try the case from @dipanjanS (the first post)

still got some error:

/usr/local/lib/python3.6/dist-packages/transformers/pipelines.py in <dictcomp>(.0)
   1636                     with torch.no_grad():
   1637                         # Retrieve the score for the context tokens only (removing question tokens)
-> 1638                         fw_args = {k: torch.tensor(v, device=self.device) for (k, v) in fw_args.items()}
   1639                         start, end = self.model(**fw_args)[:2]
   1640                         start, end = start.cpu().numpy(), end.cpu().numpy()

ValueError: expected sequence of length 384 at dim 1 (got 317)
bdalal commented 3 years ago

https://github.com/huggingface/transformers/blob/f6cb0f806efecb64df40c946dacaad0adad33d53/src/transformers/pipelines.py#L1618 is causing this issue. Padding to max_length solves this problem. Currently, if the text is long, the final span is not padded to the max_seq_len of the model.

dipanjanS commented 3 years ago

Yes agreed I think that is related to the recent code push based on the PR linked earlier. Would be great if this could be looked into HF team!

On Tue, Aug 11, 2020 at 11:18 PM Binoy Dalal notifications@github.com wrote:

https://github.com/huggingface/transformers/blob/f6cb0f806efecb64df40c946dacaad0adad33d53/src/transformers/pipelines.py#L1618 https://mailtrack.io/trace/link/26fa516997f20e87e713b4c04065c74bbadf3226?url=https%3A%2F%2Fgithub.com%2Fhuggingface%2Ftransformers%2Fblob%2Ff6cb0f806efecb64df40c946dacaad0adad33d53%2Fsrc%2Ftransformers%2Fpipelines.py%23L1618&userId=3535544&signature=c1f087ce57177138 is causing this issue. Padding to max_length solves this problem. Currently, if the text is long, the final span is not padded to the max_seq_len of the model.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://mailtrack.io/trace/link/4b6aa40826e8c36d7aebe9207d4f60b6bd245a74?url=https%3A%2F%2Fgithub.com%2Fhuggingface%2Ftransformers%2Fissues%2F6144%23issuecomment-672130943&userId=3535544&signature=a98099ac20ab30b6, or unsubscribe https://mailtrack.io/trace/link/2613e10aaae39303a4e72607615d815ac84ac486?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA2J3R3U2QJ6XORN26GC5JTSAF767ANCNFSM4PMDZHVQ&userId=3535544&signature=aef7c99cd5c66eaa .

LysandreJik commented 3 years ago

Solved by https://github.com/huggingface/transformers/issues/6875

dipanjanS commented 3 years ago

Awesome thanks folks!