scriptin / jmdict-simplified

JMdict, JMnedict, Kanjidic, KRADFILE/RADKFILE in JSON format
Creative Commons Attribution Share Alike 4.0 International
196 stars 13 forks source link

xref element in JMdict sometimes contains a reb with JIS centre-dots #8

Closed luke-c closed 6 years ago

luke-c commented 6 years ago

Whilst playing around parsing the xref field in my own parser I noticed that there is a problem with the xref field in the original XML file.

The JIS centre-dot '・' is used to separate components of the xref but some reb contain that centre dot, so you get xrefs like: <xref>ブロードノーズ・セブンギル・シャーク</xref> <xref>イエローテール・スターリー・ラビットフィッシュ</xref>

From my short investigations it seems like it is only these two xrefs which have this problem.

Parsing these by splitting on the centre-dot will get you a list of 3 strings but it actually should only be a list of a single string.

I have contacted Jim Breen the author of JMdict, but in the meantime the solution is to just hard-code a check for these two xrefs and return it as is instead of splitting them by centre dot, as they both relate to a single reb.

luke-c commented 6 years ago

This has been fixed in the source XML files by the author as of today, the xref target of those two is now the non-dotted version of the target entry.

A comment has also been added to the DTD for contributors saying not to use a target keb/reb with a nakaguro in it.

The only change your side is to now regenerate the JSON files