nicolas-raoul / jakaroma

Java library and command-line tool to transliterate Japanese kanji to romaji (Latin alphabet)
Apache License 2.0
63 stars 9 forks source link

Final small tsu ッ not transliterated #1

Open nicolas-raoul opened 8 years ago

eadmaster commented 8 years ago

I've came up with a workaround for this that consists in merging 2 consecutive tokens into 1, i'm going send you a pull request for this too!

Btw, a small tsu at the end of the word may also indicate an exclamation mark.

nicolas-raoul commented 8 years ago

Thanks :-)

malkazoid commented 4 years ago

Hello!

Thank you for building this tool!

I am running into incidences of tokens that either are :

The source for the jakaroma class has a variable that does not seem to get used in the end: lastTokenToMerge?

Do you have any suggestions, or thoughts? I'm a beginner with Java so not sure how much I can contribute, but if you point me in the right direction, I'm happy to try and push things ahead a bit. For now I've taken the stop gap approach of creating an array list of exceptions to the "サ変接続" classification which must be added to manually as these occurrences arise, and which then get correctly converted and inserted into the romaji string buffer. Probably not the best way forward but makes me feel like I'm making some sort of progress each time there is a problem with it :)

For the small tsu issue, it looks like someone had started to implement a fix, but the code doesn't actually merge the token ending with small tsu with the next one (if I'm understanding the intent correctly). Was this 'lastTokenToMerge' variable supposed to be evaluated by another if clause, that tells the next token to prepend it to itself (and I imagine, double the first consonant)? I'm going to implement that here for myself but wanted to make sure I had understood your intent?

Thanks again for making this tool!

nicolas-raoul commented 4 years ago

@malkazoid Thanks for the feedback! Unfortunately I don't remember much of the code and have other very busy projects, but I am looking forward to your pull requests :-)

nicolas-raoul commented 4 years ago

I just downloaded the tool and tested a bit, indeed the behavior is very broken. もらった returns Ta whereas it should return Moratta, which by the way means that the needs to look at the next letter and double it. 誕生 returns 誕生 whereas it should return Tanjo- or similar 誕生日 returns 誕生Bi whereas it should return Tanjo-bi or similar すごっ returns Sugo which is not bad, Sugo! would be good too I guess. ピッザ returns ピッザ whereas it should return Pizza

malkazoid commented 4 years ago

Great, we're on the same page. I'll fix as much of this as I can and put in a pull request.

Thx!

On Sat, Apr 11, 2020 at 11:38 AM Nicolas Raoul notifications@github.com wrote:

I just downloaded the tool and tested a bit, indeed the behavior is very broken. もらった returns Ta whereas it should return Moratta, which by the way means that the っ needs to look at the next letter and double it. 誕生 returns 誕生 whereas it should return Tanjo- or similar 誕生日 returns 誕生Bi whereas it should return Tanjo-bi or similar すごっ returns Sugo which is not bad, Sugo! would be good too I guess. ピッザ returns ピッザ whereas it should return Pizza

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nicolas-raoul/jakaroma/issues/1#issuecomment-612389608, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAI6YH6XGC6YFHKIDMGGOLRMBCCZANCNFSM4CAUB7YA .

--

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.