goldsmith / Wikipedia

A Pythonic wrapper for the Wikipedia API
https://wikipedia.readthedocs.org/
MIT License
2.87k stars 519 forks source link

Disambiguation error gives titles that leads to the same error #244

Open Pejosonic opened 3 years ago

Pejosonic commented 3 years ago

Hi, this example happened working with wikipedia lang 'es'. I'm trying to get wikipedia.page('Alfa Romeo Giulia') and getting Disambiguation Error with this options:

['Alfa Romeo Giulia', 
'Alfa Romeo Giulia GT Veloce', 
'Alfa Romeo Giulia TZ', 
'Alfa Romeo Giulia']

The first and last options lead me to the same error. I cannot get the actual URLs from the error arguments.

For this case they would be:

https://es.wikipedia.org/wiki/Alfa_Romeo_Giulia_(1962)
https://es.wikipedia.org/wiki/Alfa_Romeo_Giulia_GT_Veloce
https://es.wikipedia.org/wiki/Alfa_Romeo_Giulia_TZ
https://es.wikipedia.org/wiki/Alfa_Romeo_Giulia_(2015)

Thanks!

SchulerSimon commented 3 years ago

I have the same problem with wikipedia.summary. This happend with lang 'de'.

import wikipedia
wikipedia.set_lang("de")

try:
    summary: str = wikipedia.summary("Schlacht von Pjöngjang")
except wikipedia.exceptions.DisambiguationError as e:
    new_query = e.options[-1:][0] #select the last suggestion
    summary: str = wikipedia.summary(new_query)

Yields another wikipedia.exceptions.DisambiguationError.

LaZoRBear commented 3 years ago

There is a quick fix that I tested with french, that seemed to work great. The main problem is that in the handling of the disambiguation, the code returns the HTML text in that list rather than the title which corresponds to the correct title of the page.

i.e.: Émancipation returns as its last element in the disambiguation list: "Emancipation". Which is bad because it is the same has the first search, but the title is: "Emancipation (Stargate)". A search with the title yielded the correct page.

So I updated this line on my wikipedia.py file from this: may_refer_to = [li.a.get_text() for li in filtered_lis if li.a]

to this: may_refer_to = [li.a.get('title') for li in filtered_lis if li.a]

So far it has worked great in my limited testing in french. There is always the posibility to return the href of each instead of the title to be able to call the page directly with a GET request.