nicolas-raoul / jakaroma

Java library and command-line tool to transliterate Japanese kanji to romaji (Latin alphabet)
Apache License 2.0
63 stars 9 forks source link

Small tsu fixes #9

Closed malkazoid closed 4 years ago

malkazoid commented 4 years ago

Small tsu now becomes an exclamation mark in the romaji output, if the token was the last in the sequence.

Otherwise, small tsu merges the token it is found in with the following token, and causes the first letter of the following token to double in the romaji output.

malkazoid commented 4 years ago

Hi @nicolas-raoul - Please bear with me as this is my first time using github. You'll notice the comparison between the the master version of Jakaroma and the one proposed in this request shows no overlap, as if the one I wrote is completely new and different: it isn't. This occurred, I think, because I started out using Jakaroma as a download, rather than on pipe with github. So I already had a modified version of the class by the time I properly set up github locally in eclipse and started working with a local version of the Jakaroma repository. I cut and paste all of my modified class, over all of the existing Jakaroma class, not realising that this would cause the comparison to become useless. I guess I imagined it would still go line by line to determine which lines had actually changed, rather than deciding identical lines have changed simply because they have been pasted in! I hope this does not cause too much inconvenience, and I will know about this in the future.

I included a little test class so you can quickly run it and see the code in action.

nicolas-raoul commented 4 years ago

Looks good! Adding tests is a great idea, do not hesitate to add many test strings to test many patterns :-)

I noticed this problem:

$ ./jakaroma.sh ホッ
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: index -1,length 0
    at java.base/java.lang.String.checkIndex(String.java:3278)
    at java.base/java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:916)
    at java.base/java.lang.StringBuffer.deleteCharAt(StringBuffer.java:485)
    at fr.free.nrw.jakaroma.Jakaroma.main(Jakaroma.java:86)
malkazoid commented 4 years ago

Thanks Nicolas, I'll fix it, and move on to some of the other cases as well.

On Tue, Apr 14, 2020 at 4:41 AM Nicolas Raoul notifications@github.com wrote:

Looks good! Adding tests is a great idea, do not hesitate to add many test strings to test many patterns :-)

I noticed this problem:

$ ./jakaroma.sh ホッ

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: index -1,length 0

at java.base/java.lang.String.checkIndex(String.java:3278)

at java.base/java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:916)

at java.base/java.lang.StringBuffer.deleteCharAt(StringBuffer.java:485)

at fr.free.nrw.jakaroma.Jakaroma.main(Jakaroma.java:86)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nicolas-raoul/jakaroma/pull/9#issuecomment-613209376, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAI6YECK5VGPU4V3OH45P3RMPLPDANCNFSM4MHJYDWQ .

--

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

malkazoid commented 4 years ago

Hello @nicolas-raoul - I think I've correctly pushed the new commit to github, and it looks like it has inscribed the new commit under this same open pull request. Let me know if I did something wrong during that process. All small tsu substitution should be working now, although I'm sure there some unforeseen cases out there that we'll eventually run into... The ピッザ problem was a blatant and strange refusal by kuromoji to process that token - so it just passed on the kana without conversion. I've implemented a manual conversion character by character for those cases.

nicolas-raoul commented 4 years ago

Merged, thanks :-)

nicolas-raoul commented 4 years ago

By the way, for the ピッザ problem, if you think Kuromoji has a bug, then the best thing to do is to create an issue at https://github.com/atilika/kuromoji (they have disabled the "issues" feature of GitHub so the only way seems to be sending them an email at the address seen at the bottom of that page).