meilisearch / charabia

Library used by Meilisearch to tokenize queries and documents
MIT License
261 stars 89 forks source link

Numbers are not segmented the same way depending on the Script/Language #271

Closed ManyTheFish closed 1 month ago

ManyTheFish commented 9 months ago

Description

The string hello 95532387 is segmented as "hello", " ", "95532387" where the string สปริงเกียร์ร่วม 95532387 will be segmented as "สปร\u{e34}ง", "เก\u{e35}ยร\u{e4c}", "ร\u{e48}วม", " ", "9", "5", "5", "3", "2", "3", "8", "7".

Expected behavior

The number 95532387 should always be segmented as "95532387".

How to solve the issue

Before calling a specialized segmenter during the segmentation process, we should check if the interleaved string only contains numbers, if not we call the specialized segmenter else we return directly Some(s).

Related


Hey! 👋 Before starting any implementation, make sure that you read the CONTRIBUTING.md file. In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer. Thanks a lot for your Contribution! 🤝

Abastien1734 commented 9 months ago

If i want to work on this, does it need to be assigned to me or just fork and start working?

irevoire commented 9 months ago

Hey @Abastien1734, We do not assign people to issues; you can start working straight away! Happy to know you’re interested in the product. Thanks! 😁

muffpy commented 8 months ago

@Abastien1734 are you still working on this or can I take over? 🙂

Abastien1734 commented 8 months ago

@muffpy you can take over

239yash commented 8 months ago

@ManyTheFish Hi, I was checking for the implementation part in the code, which you pointed out in the description of this issue. I was thinking of changes, something like this in the mod.rs file-

                if s.chars().all(char::is_numeric) {
                    Some(s)
                } else {
                    self.current = self.segmenter.segment_str(s);
                    self.next() 
                }

Please help me, if I am going in the right direction.

ManyTheFish commented 8 months ago

Hello @239yash, What you suggested could work. However, it's not sufficient for floating points; you could have dots in the provided string, but it wouldn't match your case. Could you make sure to add some tests in the codebase?

239yash commented 8 months ago

@ManyTheFish I will add the case for handling floating point numbers too, If the current approach doesn't work out. Let me add that, Will add the test cases also after implementation. Thanks!

239yash commented 7 months ago

@ManyTheFish

 if s.parse::<f64>().is_ok() {
                        Some(s)
                    } else {
                        self.current = self.segmenter.segment_str(s);
                        self.next()
                    }

I have tested for floating point numbers, but it's not working out as the string "1234.5" is already broken in two pieces when passed to the above code block i.e. - "1234" and "5". This logic is working fine for pure integer strings. Can you help me with this?

Should I raise a PR for this, so that you can have a look into this?

One more thing, I noticed while running test cases for the given below languages, the test segmenter_segment_str is also failing.

segmenter::japanese::test::segmenter_segment_str
segmenter::khmer::test::segmenter_segment_str
segmenter::thai::test::segmenter_segment_str

The error printed is -

---- segmenter::japanese::test::segmenter_segment_str stdout ----
thread 'segmenter::japanese::test::segmenter_segment_str' panicked at charabia/src/segmenter/japanese.rs:144:5:
assertion `left == right` failed: 
Segmenter JapaneseSegmenter didn't segment the text as expected.

help: the `segmented` text provided to `test_segmenter!` does not corresponds to the output of the tested segmenter, it's probably due to a bug in the segmenter or a mistake in the provided segmented text.

  left: ["関西", "国際", "空港", "限定", "トート", "バッグ", " ", "1", "2", "3", "4", " ", "すもも", "も", "もも", "も", "もも", "の", "うち"]
 right: ["関西", "国際", "空港", "限定", "トート", "バッグ", " ", "1234", " ", "すもも", "も", "もも", "も", "もも", "の", "うち"]

The number "1234" is getting split into individual digit strings. Can you help me with what extra changes could be done for this? I tried a few things but failed in that.

ManyTheFish commented 7 months ago

Hello @239yash, Why don't you go for something simpler, like:

if s.chars().all(|c| c.is_numeric() || c.is_ponctuation()) {

You may know that the separators are customizable in Charabia, meaning that they have already been processed before calling this part of the code. Let's say the . is part of the separators, then the text 123.456 will be preprocessed as ["123", ".", "456"].

For the test case, it's expected that the digit characters will now be joined together.

42plamusse commented 6 months ago

Hello @ManyTheFish, I trying my luck on this first issue and it led me to 2 questions:

It looks like I got to the same point than @239yash, I will be happy to collaborate if they are still on this issue.

ManyTheFish commented 6 months ago

Hello @42plamusse,

Are the floating point numbers currently supported ? In the latin segmenter tests 32.3 is expected to become ["32", ".", "3"].

Yes, you're right, but the test uses the default separator set, including . which separates the lemmas linked by it. However, this set is customizable, and . could be removed from the list, meaning that it shouldn't be separated anymore, so it's possible to receive ["32.3"] but not by default.

The segmenter_segment_str test is using AhoSegmentedStrIter directly and not SegmentedStrIter, so languages like Thai using the FST_SEGMENTER won't pass the integer test with the fix you proposed. Is it a fix issue or is it a test issue?

Yes you're right, the test macro should be updated to separate segmenter_segment_str expected output from segment. 🤔

In an another hand, we could remove numbers from the tests in the specialized tokenizer.

dqkqd commented 1 month ago

@ManyTheFish Hi, I have created a PR for this issue.

Instead of checking numbers in SegmentedStrIter, I check them in AhoSegmentedStrIter and return a Match. Doing this way could reduce the amount of code changes because not every tests access SegmentedStrIter. I added 2 more tests for numbers for segmenter and tokenizer as well.

For checking number, I used the method proposed above.

s.chars().all(|c| c.is_numeric() || c.is_ascii_punctuation())

However, I don't think this is a very elegant way to do, because it can fail or give incorrect result in some cases (e.g 1e5, 1.2.3) At first, I intended to use parse::64 but this can fail if the string is too long.

zaira-bibi commented 1 month ago

Hi! If @dqkqd's PR gets merged, I'll be happy to take it up further to handle the cases that they mentioned give incorrect results.