Closed ManyTheFish closed 1 month ago
If i want to work on this, does it need to be assigned to me or just fork and start working?
Hey @Abastien1734, We do not assign people to issues; you can start working straight away! Happy to know you’re interested in the product. Thanks! 😁
@Abastien1734 are you still working on this or can I take over? 🙂
@muffpy you can take over
@ManyTheFish Hi, I was checking for the implementation part in the code, which you pointed out in the description of this issue. I was thinking of changes, something like this in the mod.rs file-
if s.chars().all(char::is_numeric) {
Some(s)
} else {
self.current = self.segmenter.segment_str(s);
self.next()
}
Please help me, if I am going in the right direction.
Hello @239yash, What you suggested could work. However, it's not sufficient for floating points; you could have dots in the provided string, but it wouldn't match your case. Could you make sure to add some tests in the codebase?
@ManyTheFish I will add the case for handling floating point numbers too, If the current approach doesn't work out. Let me add that, Will add the test cases also after implementation. Thanks!
@ManyTheFish
if s.parse::<f64>().is_ok() {
Some(s)
} else {
self.current = self.segmenter.segment_str(s);
self.next()
}
I have tested for floating point numbers, but it's not working out as the string "1234.5" is already broken in two pieces when passed to the above code block i.e. - "1234" and "5". This logic is working fine for pure integer strings. Can you help me with this?
Should I raise a PR for this, so that you can have a look into this?
One more thing, I noticed while running test cases for the given below languages, the test segmenter_segment_str is also failing.
segmenter::japanese::test::segmenter_segment_str
segmenter::khmer::test::segmenter_segment_str
segmenter::thai::test::segmenter_segment_str
The error printed is -
---- segmenter::japanese::test::segmenter_segment_str stdout ----
thread 'segmenter::japanese::test::segmenter_segment_str' panicked at charabia/src/segmenter/japanese.rs:144:5:
assertion `left == right` failed:
Segmenter JapaneseSegmenter didn't segment the text as expected.
help: the `segmented` text provided to `test_segmenter!` does not corresponds to the output of the tested segmenter, it's probably due to a bug in the segmenter or a mistake in the provided segmented text.
left: ["関西", "国際", "空港", "限定", "トート", "バッグ", " ", "1", "2", "3", "4", " ", "すもも", "も", "もも", "も", "もも", "の", "うち"]
right: ["関西", "国際", "空港", "限定", "トート", "バッグ", " ", "1234", " ", "すもも", "も", "もも", "も", "もも", "の", "うち"]
The number "1234" is getting split into individual digit strings. Can you help me with what extra changes could be done for this? I tried a few things but failed in that.
Hello @239yash, Why don't you go for something simpler, like:
if s.chars().all(|c| c.is_numeric() || c.is_ponctuation()) {
You may know that the separators are customizable in Charabia, meaning that they have already been processed before calling this part of the code.
Let's say the .
is part of the separators, then the text 123.456
will be preprocessed as ["123", ".", "456"]
.
For the test case, it's expected that the digit characters will now be joined together.
Hello @ManyTheFish, I trying my luck on this first issue and it led me to 2 questions:
32.3
is expected to become ["32", ".", "3"]
.segmenter_segment_str
test is using AhoSegmentedStrIter
directly and not SegmentedStrIter
, so languages like Thai using the FST_SEGMENTER
won't pass the integer test with the fix you proposed. Is it a fix issue or is it a test issue ?It looks like I got to the same point than @239yash, I will be happy to collaborate if they are still on this issue.
Hello @42plamusse,
Are the floating point numbers currently supported ? In the latin segmenter tests 32.3 is expected to become ["32", ".", "3"].
Yes, you're right, but the test uses the default separator set, including .
which separates the lemmas linked by it. However, this set is customizable, and .
could be removed from the list, meaning that it shouldn't be separated anymore, so it's possible to receive ["32.3"] but not by default.
The
segmenter_segment_str
test is usingAhoSegmentedStrIter
directly and not SegmentedStrIter, so languages like Thai using the FST_SEGMENTER won't pass the integer test with the fix you proposed. Is it a fix issue or is it a test issue?
Yes you're right, the test macro should be updated to separate segmenter_segment_str
expected output from segment
. 🤔
In an another hand, we could remove numbers from the tests in the specialized tokenizer.
@ManyTheFish Hi, I have created a PR for this issue.
Instead of checking numbers in SegmentedStrIter
, I check them in AhoSegmentedStrIter
and return a Match
. Doing this way could reduce the amount of code changes because not every tests access SegmentedStrIter
.
I added 2 more tests for numbers for segmenter and tokenizer as well.
For checking number, I used the method proposed above.
s.chars().all(|c| c.is_numeric() || c.is_ascii_punctuation())
However, I don't think this is a very elegant way to do, because it can fail or give incorrect result in some cases (e.g 1e5, 1.2.3)
At first, I intended to use parse::64
but this can fail if the string is too long.
Hi! If @dqkqd's PR gets merged, I'll be happy to take it up further to handle the cases that they mentioned give incorrect results.
Description
The string
hello 95532387
is segmented as"hello", " ", "95532387"
where the stringสปริงเกียร์ร่วม 95532387
will be segmented as"สปร\u{e34}ง", "เก\u{e35}ยร\u{e4c}", "ร\u{e48}วม", " ", "9", "5", "5", "3", "2", "3", "8", "7"
.Expected behavior
The number
95532387
should always be segmented as"95532387"
.How to solve the issue
Before calling a specialized segmenter during the segmentation process, we should check if the interleaved string only contains numbers, if not we call the specialized segmenter else we return directly
Some(s)
.Related