organicmaps / organicmaps

🍃 Organic Maps is a free Android & iOS offline maps app for travelers, tourists, hikers, and cyclists. It uses crowd-sourced OpenStreetMap data and is developed with love by MapsWithMe (MapsMe) founders and our community. No ads, no tracking, no data collection, no crapware. Please donate to support the development!
https://organicmaps.app
Apache License 2.0
9.46k stars 912 forks source link

Wikipedia text discards ` ` #8651

Open habi opened 1 month ago

habi commented 1 month ago

Wikipedia text for a feature discards non-breaking spaces ( )

Screenshots Organic Maps Organic Maps

Wikipedia Wikipedia page

Wikipedia edit mode Wikipedia page source

System information:

newsch commented 1 month ago

Thanks for making this issue and investigating the cause! Looking at the article from the WM Enterprise API, it seems that   is replaced by a span with the literal version of the space:

<p>... Before 1941, average salinity was approximately 50<span typeof="mw:Entity" id="mw1w"> </span>grams per liter (g/L) (compared to a value of 31.5 g/L for the world's oceans). In January 1982, when the lake reached its lowest level of <span ... id="mw2A">1,942 metres (6,372</span><span typeof="mw:Entity" about="#mwt165"> </span><span about="#mwt165">ft)</span>, the salinity had nearly doubled to 99 g/L...</p>

We remove all elements that are empty/whitespace after processing, inadvertently removing this span and any created by the convert template.

biodranik commented 1 month ago

@newsch if nbsp is replaced with a normal space in span, can we detect it and process properly? Maybe an issue should be filed in the Wiki enterprise API?

newsch commented 1 month ago

There's still a non-breaking space, but instead of &nbsp it's UTF-8 bytes c2a0. GitHub seems to replace it with a normal space in comments.

In a hexdump of Mono_Lake.html, the literal nbsp is at 00010eba:

00010e80: 7479 2077 6173 2061 7070 726f 7869 6d61  ty was approxima
00010e90: 7465 6c79 2035 303c 7370 616e 2074 7970  tely 50<span typ
00010ea0: 656f 663d 226d 773a 456e 7469 7479 2220  eof="mw:Entity"
00010eb0: 6964 3d22 6d77 3177 223e c2a0 3c2f 7370  id="mw1w">..</sp
00010ec0: 616e 3e67 7261 6d73 2070 6572 206c 6974  an>grams per lit

I have a fix that handles this and other significant whitespace-only elements, I'll push it shortly.

habi commented 1 month ago

Thanks for making this issue and investigating the cause!

Thank you for looking into such a minor issue so quickly.