Numbers are not segmented the same way depending on the Script/Language

ManyTheFish commented 9 months ago

Description

The string hello 95532387 is segmented as "hello", " ", "95532387" where the string สปริงเกียร์ร่วม 95532387 will be segmented as "สปร\u{e34}ง", "เก\u{e35}ยร\u{e4c}", "ร\u{e48}วม", " ", "9", "5", "5", "3", "2", "3", "8", "7".

Expected behavior

The number 95532387 should always be segmented as "95532387".

How to solve the issue

Before calling a specialized segmenter during the segmentation process, we should check if the interleaved string only contains numbers, if not we call the specialized segmenter else we return directly Some(s).

---- segmenter::japanese::test::segmenter_segment_str stdout ----
thread 'segmenter::japanese::test::segmenter_segment_str' panicked at charabia/src/segmenter/japanese.rs:144:5:
assertion `left == right` failed: 
Segmenter JapaneseSegmenter didn't segment the text as expected.

help: the `segmented` text provided to `test_segmenter!` does not corresponds to the output of the tested segmenter, it's probably due to a bug in the segmenter or a mistake in the provided segmented text.

  left: ["関西", "国際", "空港", "限定", "トート", "バッグ", " ", "1", "2", "3", "4", " ", "すもも", "も", "もも", "も", "もも", "の", "うち"]
 right: ["関西", "国際", "空港", "限定", "トート", "バッグ", " ", "1234", " ", "すもも", "も", "もも", "も", "もも", "の", "うち"]

The number "1234" is getting split into individual digit strings. Can you help me with what extra changes could be done for this? I tried a few things but failed in that.

ManyTheFish commented 7 months ago

Hello @239yash, Why don't you go for something simpler, like:

if s.chars().all(|c| c.is_numeric() || c.is_ponctuation()) {

You may know that the separators are customizable in Charabia, meaning that they have already been processed before calling this part of the code. Let's say the . is part of the separators, then the text 123.456 will be preprocessed as ["123", ".", "456"].

For the test case, it's expected that the digit characters will now be joined together.

42plamusse commented 6 months ago

Hello @ManyTheFish, I trying my luck on this first issue and it led me to 2 questions:

Are the floating point numbers currently supported ? In the latin segmenter tests 32.3 is expected to become ["32", ".", "3"].
The segmenter_segment_str test is using AhoSegmentedStrIter directly and not SegmentedStrIter, so languages like Thai using the FST_SEGMENTER won't pass the integer test with the fix you proposed. Is it a fix issue or is it a test issue ?

It looks like I got to the same point than @239yash, I will be happy to collaborate if they are still on this issue.

ManyTheFish commented 6 months ago

Hello @42plamusse,

Are the floating point numbers currently supported ? In the latin segmenter tests 32.3 is expected to become ["32", ".", "3"].

Yes, you're right, but the test uses the default separator set, including . which separates the lemmas linked by it. However, this set is customizable, and . could be removed from the list, meaning that it shouldn't be separated anymore, so it's possible to receive ["32.3"] but not by default.

The segmenter_segment_str test is using AhoSegmentedStrIter directly and not SegmentedStrIter, so languages like Thai using the FST_SEGMENTER won't pass the integer test with the fix you proposed. Is it a fix issue or is it a test issue?

Yes you're right, the test macro should be updated to separate segmenter_segment_str expected output from segment. 🤔

In an another hand, we could remove numbers from the tests in the specialized tokenizer.

dqkqd commented 1 month ago

@ManyTheFish Hi, I have created a PR for this issue.

Instead of checking numbers in SegmentedStrIter, I check them in AhoSegmentedStrIter and return a Match. Doing this way could reduce the amount of code changes because not every tests access SegmentedStrIter. I added 2 more tests for numbers for segmenter and tokenizer as well.

For checking number, I used the method proposed above.

s.chars().all(|c| c.is_numeric() || c.is_ascii_punctuation())

However, I don't think this is a very elegant way to do, because it can fail or give incorrect result in some cases (e.g 1e5, 1.2.3) At first, I intended to use parse::64 but this can fail if the string is too long.

zaira-bibi commented 1 month ago

Hi! If @dqkqd's PR gets merged, I'll be happy to take it up further to handle the cases that they mentioned give incorrect results.

meilisearch / charabia