twitter / twitter-cldr-rb

Ruby implementation of the ICU (International Components for Unicode) that uses the Common Locale Data Repository to format dates, plurals, and more.
Apache License 2.0
672 stars 93 forks source link

Breaking by word a string containing Japanese and Latin characters #260

Closed edouard closed 2 years ago

edouard commented 2 years ago

Describe the bug

We’re using TwitterCldr::Segmentation::BreakIterator’seach_word method to count words in multiple languages. We just got an exception for a string in Japanese, which contains both Japanese and Latin characters. This is common for when using Western brand names for instance.

To Reproduce

Steps to reproduce the behavior:

string = 'TWITTERド'
iterator = TwitterCldr::Segmentation::BreakIterator.new(:ja)
iterator.each_word(string) {|word| puts word }
#=> /Users/edouard/.rvm/gems/ruby-3.1.2@webtranslateit.com/gems/twitter_cldr-6.11.3/lib/twitter_cldr/segmentation/cj_break_engine.rb:110:in `<': comparison of Integer with nil failed (ArgumentError)

Also, this string works:

string = 'WINDYのアカウントを作成する'
iterator = TwitterCldr::Segmentation::BreakIterator.new(:ja)
iterator.each_word(string) {|word| puts word }
#=> WINDY
の
アカウントを作成する
 => #<Enumerator: ...> 

Interestingly enough, taking that string above and replacing WINDY with TWITTER doesn’t work 🤔:

string = 'TWITTERのアカウントを作成する'
iterator = TwitterCldr::Segmentation::BreakIterator.new(:ja)
iterator.each_word(string) {|word| puts word }
#=> /Users/edouard/.rvm/gems/ruby-3.1.2@webtranslateit.com/gems/twitter_cldr-6.11.3/lib/twitter_cldr/segmentation/cj_break_engine.rb:110:in `<': comparison of Integer with nil failed (ArgumentError)

Expected behavior

The BreakIterator shouldn't raise an exception

Screenshots If applicable, add screenshots to help explain your problem.

Environment ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-darwin21]

Additional context Add any other context about the problem here.

edouard commented 2 years ago

Answering my own questions here...

Interestingly enough, taking that string above and replacing WINDY with TWITTER doesn’t work 🤔:

string = 'TWITTERのアカウントを作成する'
iterator = TwitterCldr::Segmentation::BreakIterator.new(:ja)
iterator.each_word(string) {|word| puts word }
#=> /Users/edouard/.rvm/gems/ruby-3.1.2@webtranslateit.com/gems/twitter_cldr-6.11.3/lib/twitter_cldr/segmentation/cj_break_engine.rb:110:in `<': comparison of Integer with nil failed (ArgumentError)

It seems to be due to the length of the latin word:

string = 'TWITTのアカウントを作成する'
iterator = TwitterCldr::Segmentation::BreakIterator.new(:ja)
iterator.each_word(string) {|word| puts word }
#=> TWITT
の
アカウントを作成する

Looks like the error we see has to do with the length of the latin word.

https://github.com/twitter/twitter-cldr-rb/blob/09a1db07bf68b397e482d90290b0bb886ee076e1/lib/twitter_cldr/segmentation/cj_break_engine.rb#L149-L155

camertron commented 2 years ago

Hey @edouard, thanks for reporting this. Please see #261 for fix details. The fix has been published in v6.11.4.

edouard commented 2 years ago

Cool! Thanks for fixing it so quickly! 👍🏽