mozilla / readability

A standalone version of the readability lib
Other
9k stars 607 forks source link

NHK Easy News: Ruby tags sometimes stripped and resulting in invalid text #575

Closed aehlke closed 4 years ago

aehlke commented 4 years ago

Example: https://www3.nhk.or.jp/news/easy/k10011959621000/k10011959621000.html

Screenshot of the first sentence of the original article: image

Screenshot of the resulting Readability output: image

Notice how "虐待" which originally has ruby becomes "虐待ぎゃくたい". It should retain the ruby (or at worst, if it must strip the ruby for some reason, it should not retain the ruby text inline.)

aehlke commented 4 years ago

@gijsk #579 does not resolve this - please reopen. This issue occurs regardless of javascript: links.

gijsk commented 4 years ago

@aehlke I updated the copy of readability in Firefox Nightly over the weekend. Having just checked on current nightly ( https://nightly.mozilla.org/ ), the example from the screenshots in your original comment seems fixed to me. This makes sense, because the ruby tag that is disappearing there is inside <a href="javascript:void(0)" class="dicWin" id="id-0002">. If you think this is still broken (ie we're stripping ruby tags elsewhere), can you elaborate on where that is / what testcase to use?

aehlke commented 4 years ago

Sorry, you're right. I believe I was seeing this elsewhere without anchor tags. I'll confirm and file with a new example. Thanks for your help.