tshatrov / ichiran

Linguistic tools for texts in Japanese language
MIT License
285 stars 30 forks source link

Additional data getting inserted into json results #22

Closed molesquirrel closed 2 years ago

molesquirrel commented 2 years ago

Test string: それはすごいね Command used: ./ichiran-cli -f "それはすごいね" (using the 202201 dictionary dump) Results: [[[[["sore wa",{"reading":"\u305D\u308C\u306F","text":"\u305D\u308C\u306F","kana":"\u305D\u308C\u200B\u200C\u306F","score":144,"seq":2134680,"gloss":[{"pos":"[adv]","gloss":"very; extremely"},{"pos":"[exp]","gloss":"that is"}],"conj":[]},[]],["sugoi",{"reading":"\u3059\u3054\u3044","text":"\u3059\u3054\u3044","kana":"\u3059\u3054\u3044","score":144,"seq":1374550,"gloss":[{"pos":"[adj-i]","gloss":"terrible; dreadful"},{"pos":"[adj-i]","gloss":"amazing (e.g. of strength); great (e.g. of skills); wonderful; terrific"},{"pos":"[adj-i]","gloss":"to a great extent; vast (in numbers)"},{"pos":"[adv]","gloss":"awfully; very; immensely"}],"conj":[]},[]],["ne",{"reading":"\u306D","text":"\u306D","kana":"\u306D","score":16,"seq":2029080,"gloss":[{"pos":"[prt]","gloss":"right?; isn't it?; doesn't it?; don't you?; don't you think?","info":"at sentence end; used as a request for confirmation or agreement"},{"pos":"[int]","gloss":"hey; say; listen; look; come on"},{"pos":"[prt]","gloss":"you know; you see; I must say; I should think","info":"at sentence end; used to express one's thoughts or feelings"},{"pos":"[prt]","gloss":"will you?; please","info":"at sentence end; used to make an informal request"},{"pos":"[prt]","gloss":"so, ...; well, ...; you see, ...; you understand?","info":"at the end of a non-final clause; used to draw the listener's attention to something"},{"pos":"[prt]","gloss":"I'm not sure if ...; I have my doubts about whether ...","info":"at sentence end after the question marker \u304B"}],"conj":[]},[]]],304]]]

More specifically, note the values of the first term: "reading":"\u305D\u308C\u306F", "text":"\u305D\u308C\u306F", "kana":"\u305D\u308C\u200B\u200C\u306F"

The 3rd and 4th characters on kana do not appear in this text box, but do appear when viewing on a site such as https://jsonformatter.org/json-pretty-print

They appear to be a "zero width space" and "zero width non-joiner"

You noted in another post that you're not working actively on the project at the moment, but I wanted to note this in case you have a chance to look at it!

tshatrov commented 2 years ago

That is by design, to allow correct romanization of "sore wa" instead of "soreha". These are 'hints' embedded in kana, and currently there are only 2 characters used for this: https://github.com/tshatrov/ichiran/blob/master/dict-split.lisp#L785-L786

If you use romanization method :kana they would be stripped in the 'result' text (where it says "sore wa" in your example, it would be "それは" instead). You can also manually strip these 2 characters from the kana text.