goldsmith / Wikipedia

A Pythonic wrapper for the Wikipedia API
https://wikipedia.readthedocs.org/
MIT License
2.87k stars 519 forks source link

Empty 'extract' in Wikipedia response causes 'TypeError: list indices must be integers, not str'. #32

Closed dmirylenka closed 10 years ago

dmirylenka commented 10 years ago
>>> import wikipedia
>>> wikipedia.page('Fully connected network', auto_suggest=False, redirect=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/wikipedia/wikipedia.py", line 211, in page
    return WikipediaPage(title, redirect=redirect, preload=preload)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/wikipedia/wikipedia.py", line 224, in __init__
    self.load(redirect=redirect, preload=preload)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/wikipedia/wikipedia.py", line 276, in load
    self.__init__(title, redirect=redirect, preload=preload)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/wikipedia/wikipedia.py", line 224, in __init__
    self.load(redirect=redirect, preload=preload)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/wikipedia/wikipedia.py", line 250, in load
    pages = request['query']['pages']
TypeError: list indices must be integers, not str
dmirylenka commented 10 years ago

This seems to be the code that causes the problem:

    extract = request['query']['pages'][pageid]['extract']

    # extract should be of the form "REDIRECT <new title>"
    # ("REDIRECT" could be translated to current language)
    title = ' '.join(extract.split('\n')[0].split()[1:]).strip()

For this particular page ("Fully connected network") the 'extract' is empty, so the title becomes empty as well. Then the code tries to get the wikipedia page with empty title:

GET /w/api.php?inprop=url&format=json&ppprop=disambiguation&titles=&action=query&prop=info%7Cpageprops HTTP/1.1

, which eventually causes the exception in these lines:

       request = _wiki_request(**query_params)
       pages = request['query']['pages']
goldsmith commented 10 years ago

I'll look into it. Do you have any idea why extract would be empty? In my experience, the content of any redirect page should be of the form detailed in the comment.

dmirylenka commented 10 years ago

I am not very familiar with the Wikipedia API – just started using it. I have only seen empty extracts for the redirect pages so far. Another example:

http://en.wikipedia.org/w/api.php?prop=extracts&titles=Recommendation+systems&format=json&explaintext=&action=query

   {"query":{"pages":{"1648434":{"pageid":1648434,"ns":0,"title":"Recommendation systems","extract":""}}}}

Why aren't you using the 'redirects' key? E.g.

http://en.wikipedia.org/w/api.php?prop=info&titles=Recommendation+systems&format=json&action=query&redirects

   {"query":{"redirects":[{"from":"Recommendation systems","to":"Recommender system"}],"pages":{"596646":{"pageid":596646,"ns":0,"title":"Recommender system","contentmodel":"wikitext","pagelanguage":"en","touched":"2014-02-05T06:32:28Z","lastrevid":594009289,"counter":"","length":42649}}}}
goldsmith commented 10 years ago

To be honest I never knew that 'redirects' was a key you could request in the Mediawiki API, great catch! That's definitely a better solution than the hacky parsing it's doing now. I'll work on a patch this weekend.

SuzanaK commented 10 years ago

I get the same error with this line:

wikipedia.page('King Cobra (malt liquor)')

The error message is:

TypeError: list indices must be integers, not str