martin-majlis / Wikipedia-API

Python wrapper for Wikipedia
MIT License
579 stars 76 forks source link

Newline / Space missing from .summary attribute #46

Closed gruffaren closed 1 year ago

gruffaren commented 2 years ago

The .summary attribute of a page does not include a newline or space after a sentence that ends in hard brackets [ ] on the Wikipedia page.

Example:

wiki = wiki_api.Wikipedia(language="en") query = "planet" page = wiki.page(query) text = page.summary print(text[:400])

which queries the article: https://en.wikipedia.org/wiki/Planet and returns: A planet is an astronomical body orbiting a star or stellar remnant that is massive enough to be rounded by its own gravity, is not massive enough to cause thermonuclear fusion, and – according to the International Astronomical Union but not all planetary scientists – has cleared its neighbouring region of planetesimals.The term planet is ancient, with ties to history, astrology, science, mytholog

Observe the lack of space between planetesimals. and The at the first paragraph, which ends with "planetesimals.[b][1][2]" on the web-page. Whilst later in the summary, at print(text[1200:1500]) There is a space between "discovered)." and "Ptolemy" as expected: the scientific community are no longer viewed as such under the current definition. Some of the excluded objects include Ceres, Pallas, Juno, Vesta (all of which are objects in the solar asteroid belt), and Pluto (the first trans-Neptunian object discovered). Ptolemy thought that the planets orbite

Please let me know if any additional information is needed to fix this, or if there is a workaround.

martin-majlis commented 1 year ago

When I check the current status:

And I can see that in the API response is indeed missing space. :/

It's also missing that space in plain format - https://en.wikipedia.org/w/api.php?action=query&explaintext=1&exsectionformat=plain&prop=extracts&titles=Planet& - or raw format - https://en.wikipedia.org/w/api.php?action=query&explaintext=1&exsectionformat=raw&prop=extracts&titles=Planet&

It does not look to me, that it can be somehow resolved when I check documentation for the API - https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bextracts

Since it does not look fixable, I am closing this issue. If you figure out, how to bypass the problem, please, feel free to reopen this issue.

image