mistval / unofficial-jisho-api

Encapsulates the official Jisho.org API and also provides kanji, example, and stroke diagram search.
MIT License
140 stars 15 forks source link

Example sentence for word scrape senses occasionally misses pieces #39

Closed h7x4 closed 2 years ago

h7x4 commented 2 years ago

Jisho seems to sometimes not wrap details like symbols (parens, punctuation, etc.), katakana names and sometimes even words in a <li class="clearfix"> tag. This means that the parser will skip over them.

An example search is 赤い which will yield the sentence

ヘレンはみんなにほめられて顔を赤くした

Formatted as

<ul class="japanese japanese_gothic clearfix" lang="ja">
    ヘレン
    <li class="clearfix"><span class="unlinked">は</span></li>
    <li class="clearfix"><span class="unlinked">みんな</span></li>
    <li class="clearfix"><span class="unlinked">に</span></li>
    <li class="clearfix"><span class="unlinked">ほめられて</span></li>
    <li class="clearfix"><span class="furigana">かお</span><span class="unlinked">顔</span></li>
    <li class="clearfix"><span class="unlinked">を</span></li>
    <li class="clearfix"><span class="furigana">あか</span><span class="unlinked">赤く</span></li>
    <li class="clearfix"><span class="unlinked">した</span></li>
    。
    <li class="english" lang="en">Helen blushed at their praise.</li>
</ul>

which makes the parser miss ヘレン and

Some other searches are かな, たら (Particle 2, this one even has kanji missing furigana outside clearfix) and .