siznax / wptools

Wikipedia tools (for Humans): easily extract data from Wikipedia, Wikidata, and other MediaWikis
MIT License
575 stars 79 forks source link

Unicode error with non-ascii Infobox boxterm #138

Open baerbock opened 5 years ago

baerbock commented 5 years ago

I would like to extract the infobox of the Bulgarian Railway Line No. 1 article.

import wptools
page = wptools.page('Железопътна линия 1 (България)', lang='bg')
page.get_parse()
page.data['ЖП линия']

which fails.

Are infoboxes not detected if they are named in an unsual manner (ЖП линия)?

siznax commented 5 years ago

Thanks for trying wptools @baerbock!

We should have support for this with boxterm=ЖП линия:

>>> help(wptools.page)
class WPToolsPage(wptools.restbase.WPToolsRESTBase, wptools.wikidata.WPToolsWiki
data, wptools.core.WPTools)
 |  WPtools Page class, derived from wptools.core
 |
 |  Method resolution order:
 |      WPToolsPage
 |      wptools.restbase.WPToolsRESTBase
 |      wptools.wikidata.WPToolsWikidata
 |      wptools.core.WPTools
 |      __builtin__.object
 |
 |  Methods defined here:
 |
 |  __init__(self, *args, **kwargs)
 |      Returns a WPToolsPage object
 |
 |      Gets a random title without arguments
 |
 |      Optional positional {params}:
 |      - [title]: <str> Mediawiki page title, file, category, etc.
 |
 |      Optional keyword {params}:
 |      - [boxterm]: <str> Infobox title name or substring
 |      - [endpoint]: <str> alternative API endpoint (default=/w/api.php)
 |      - [lang]: <str> Mediawiki language code (default=en)
 |      - [pageid]: <int> Mediawiki pageid
 |      - [variant]: <str> Mediawiki language variant
 |      - [wiki]: <str> alternative wiki site (default=wikipedia.org)
 |      - [wikibase]: <str> Wikidata database ID (e.g. 'Q1')
 |
 |      Optional keyword {flags}:
 |      - [silent]: <bool> do not echo page data if True
 |      - [skip]: <list> skip actions in this list
 |      - [verbose]: <bool> verbose output to stderr if True
 ...

but that currently raises a UnicodeDecodeError in this case:

>>> page = wptools.page('Железопътна_линия_1_(България)', lang='bg', boxterm='ЖП линия')
>>> page.get()
bg.wikipedia.org (query) Железопътна_линия_1_(Б�...
bg.wikipedia.org (parse) 596059
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "wptools/page.py", line 522, in get
    self.get_parse(False, proxy, timeout)
  File "wptools/page.py", line 603, in get_parse
    self._get('parse', show, proxy, timeout)
  File "wptools/core.py", line 183, in _get
    self._set_data(action)
  File "wptools/page.py", line 204, in _set_data
    self._set_parse_data()
  File "wptools/page.py", line 255, in _set_parse_data
    infobox = utils.get_infobox(parsetree, boxterm)
  File "wptools/utils.py", line 37, in get_infobox
    if title and boxterm in title:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

I'll have to take a closer look at what's causing this.

siznax commented 5 years ago

Sorry for the delay here. Hope to get to this soon... 😃