openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
281 stars 72 forks source link

Hatnote is erroneously moved to second paragraph along with infobox #182

Open Jaifroid opened 6 years ago

Jaifroid commented 6 years ago

Viewing the article "Peripheral neuropathy" in wikipedia_en_medicine_novid_2018-01.zim, the hatnote "Not to be confused with..." has been shifted to the second paragraph along with the infobox (see first screenshot). On Wikipedia mobile view this line is in the correct place just below the page title. I'm guessing this is a mwoffliner issue, as it's in the HTML that comes out of the ZIM.

It's a general issue, because it also occurs with the "Melatonin" article (see second screenshot).

image

image

kelson42 commented 6 years ago

@subbuss If I look in Parsoid output, the hatnot is put in the middle (like in the ZIM) of the lead section https://en.wikipedia.org/api/rest_v1/page/mobile-sections/Melatonin. But this is not how the online version behaves, like @Jaifroid reported. Looks strange for me. Do we have a bug in Parsoid here? Or do we need to make special handling?

subbuss commented 6 years ago

https://en.wikipedia.org/api/rest_v1/page/html/Peripheral_neuropathy shows it at the right place?

kelson42 commented 6 years ago

@subbuss for the desktop version yes, but not for the mobile. See https://en.wikipedia.org/api/rest_v1/page/mobile-sections/Peripheral_neuropathy

mdholloway commented 6 years ago

For the currently deployed Wikimedia REST API mobile-sections endpoint, moving the initial paragraph of an article up top, even above hatnotes, is actually the intended behavior. This was driven by a design decision for the Wikipedia Android app, which was the endpoint's initial consumer.

In the next-generation (not-yet-active) version of mobile-sections the hatnote is broken out into the response json structure greater client flexibility.

kelson42 commented 6 years ago

@mdholloway thx. So basically in the future version the hatnote will be removed from article HTML and will be only available as metadata (like it is already available currently)?

mdholloway commented 6 years ago

@kelson42 Yes, that's correct. Hatnotes will be broken out like this: https://gist.github.com/mdholloway/5010f7c4f737cd3262288563d643240a#file-resp-txt-L27-L29

Jaifroid commented 6 years ago

Just to add that the same issue affects notes (in English Wikivoyage) of the type "For other places with the same name, see Paris (disambiguation)" (this is from the Paris article). However, this note isn't labelled "hatnote", it is inside a <dl> structure and it is identified with a CSS class of "noexcerpt". This is what the HTML looks like:

<dl>
  <dd>
    <span class="noexcerpt"><i>For other places with the same name, see 
      <a href="Paris_(disambiguation).html" title="Paris (disambiguation)">
      Paris (disambiguation)</a>.</i>
    </span>
  </dd>
</dl>

And this is the screenshot:

image

Jaifroid commented 6 years ago

Just a quick note to say this affects also French Wikipedia / WikiMed which has some "hatnote" equivalents which are rendered as <div class="homonymie" ...>. Screenshot shows one of these. It seems that whatever code moves the infoboxes down below the lead paragraph (in order to produce the mobile style) is accidentally but systematically moving hatnotes along with the infobox. Probably a regex that is not specific enough.

image

bradyhunsaker commented 6 years ago

@kelson42 I think this one gets the Parsoid tag, since my reading is that there are no plans to change mwoffliner based on the current Parsoid output. (It looks like that would be an awkward hack to even try.)

Jaifroid commented 6 years ago

Just a quick update to note that I'm still getting this error on recent ZIMs. Example below is from wikipedia_en_maths_novid_2018-06.zim, article "Series (mathematics)". I realize we may well be waiting on a change in Parsoid, but it would be good to prevent this issue from being put on the back burner...

image

kelson42 commented 5 years ago

Looks like hatnotes are completly removed now... Not sure if this is good or bad, but now this problem does not occur anymore.

Jaifroid commented 5 years ago

It may not be labelled "hatnote", but in wikivoyage_en_all_novid_2019-07.zim (made very recently), we still have a similar problem:

image

Do you want me to make a new issue for it, @kelson42, or re-open this one?

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Jaifroid commented 4 years ago

Just to say that this issue persists in wikipedia_en_all_maxi_2020-06.zim:

image

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Jaifroid commented 3 years ago

Just to keep this alive, issue still persists. In Kiwix JS Windows / PWA, I currently reposition these misplaced hatnotes, redirects and other "not to be confused with" notes, though it's a little tricky because they're not always easy to identify programmatically, and I only test in English, Spanish, German, occasionally French...

image

kelson42 commented 3 years ago

Bug is still there http://library.kiwix.org/wikipedia_en_medicine/A/Peripheral_neuropathy

Jaifroid commented 3 years ago

Issue persists (also, the infobox is poorly rendered on desktop screen sizes. It should be right-aligned and thinner on large screens; on narrow screens it is not rendered on Wikipedia mobile view).

I try to fix these display issues in Kiwix JS Windows (see screenshot far bottom). I know it's hacky, and not really the reader's job, but it seems this issue can't be fixed in the ZIM, and won't be fixed by Parsoid either...

image

2021-09-19 (3)

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.