Open baerbock opened 5 years ago
Thanks for trying wptools @baerbock!
We should have support for this with boxterm=ЖП линия
:
>>> help(wptools.page)
class WPToolsPage(wptools.restbase.WPToolsRESTBase, wptools.wikidata.WPToolsWiki
data, wptools.core.WPTools)
| WPtools Page class, derived from wptools.core
|
| Method resolution order:
| WPToolsPage
| wptools.restbase.WPToolsRESTBase
| wptools.wikidata.WPToolsWikidata
| wptools.core.WPTools
| __builtin__.object
|
| Methods defined here:
|
| __init__(self, *args, **kwargs)
| Returns a WPToolsPage object
|
| Gets a random title without arguments
|
| Optional positional {params}:
| - [title]: <str> Mediawiki page title, file, category, etc.
|
| Optional keyword {params}:
| - [boxterm]: <str> Infobox title name or substring
| - [endpoint]: <str> alternative API endpoint (default=/w/api.php)
| - [lang]: <str> Mediawiki language code (default=en)
| - [pageid]: <int> Mediawiki pageid
| - [variant]: <str> Mediawiki language variant
| - [wiki]: <str> alternative wiki site (default=wikipedia.org)
| - [wikibase]: <str> Wikidata database ID (e.g. 'Q1')
|
| Optional keyword {flags}:
| - [silent]: <bool> do not echo page data if True
| - [skip]: <list> skip actions in this list
| - [verbose]: <bool> verbose output to stderr if True
...
but that currently raises a UnicodeDecodeError
in this case:
>>> page = wptools.page('Железопътна_линия_1_(България)', lang='bg', boxterm='ЖП линия')
>>> page.get()
bg.wikipedia.org (query) Железопътна_линия_1_(Б�...
bg.wikipedia.org (parse) 596059
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "wptools/page.py", line 522, in get
self.get_parse(False, proxy, timeout)
File "wptools/page.py", line 603, in get_parse
self._get('parse', show, proxy, timeout)
File "wptools/core.py", line 183, in _get
self._set_data(action)
File "wptools/page.py", line 204, in _set_data
self._set_parse_data()
File "wptools/page.py", line 255, in _set_parse_data
infobox = utils.get_infobox(parsetree, boxterm)
File "wptools/utils.py", line 37, in get_infobox
if title and boxterm in title:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
I'll have to take a closer look at what's causing this.
Sorry for the delay here. Hope to get to this soon... 😃
I would like to extract the infobox of the Bulgarian Railway Line No. 1 article.
which fails.
Are infoboxes not detected if they are named in an unsual manner (ЖП линия)?