meilisearch / charabia

Library used by Meilisearch to tokenize queries and documents
MIT License
261 stars 89 forks source link

latin-camelcase feature make wrong segmentation #289

Closed hamano closed 3 weeks ago

hamano commented 7 months ago

The default featre has issue with proper noun segmentation like OpenSSL.

main.rs:

use std::env;
use charabia::Segment;

fn main() {
    let arg = env::args().nth(1).unwrap();
    let segments = arg.as_str().segment_str().collect::<Vec<&str>>().join("|");
    println!("{}", segments)
}

default feature:

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
Open|S|S|L| |Open|Ssl| |open|Ssl| |open|_|ssl

disable default feature

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
OpenSSL| |OpenSsl| |openSsl| |open|_|ssl
ManyTheFish commented 5 months ago

Hello @hamano, What do you expect in terms of segmentation?

Thank you!

hamano commented 5 months ago

It is ideal that proper nouns are not split, but if it is not in the dictionary, it can't avoided that OpenSSL would be split as Open|SSL. Here, I recognize the issue as splitting it as Open|S|S|L. The word SSL would disappear.

hamano commented 5 months ago

The inability to segment words not found in the dictionary was a characteristic in Japanese. Please ignore that. In any case, I believe it is desirable that 'OpenSSL' not be segment, even in Latin languages

hamano commented 5 months ago

https://github.com/CloudCannon/pagefind/issues/591

hamano commented 5 months ago

The word open_ssl is not common, but for example, in the case of a module like apache mod_ssl, it is expected not to be segmented.

ManyTheFish commented 5 months ago

Hello @hamano, I think the better way to solve this issue is to add the word SSL to the tokenizer dictionary. It seems to be a specific case more than a generality.

hamano commented 5 months ago

I'm concerned not just about the term "OpenSSL" but about countless similar terms.

OpenVPN
OpenBSD
OpenJDK
OpenCV
FreeRADIUS
FreeBSD
PostgreSQL
MySQL
MongoDB

I don't think all these terms should be added to the dictionary. New terms emerge one after another.

hamano commented 5 months ago

If you say this is the expected behavior of the latin-camelcase feature, then that's fine. I'll disable it. However, it's too inconvenient as a default feature, so I reported it, especially for use in technical documentation.

ManyTheFish commented 5 months ago

Understood, we may disable it from the default features, thinking of it, this segmentation is mainly used for the cases you are describing, so if you want to change the behavior of the segmenter, I would gladly accept a new PR. Moreover, the change seems easy, maybe replacing the highlighted code block by:

        let should_group = if last_char_was_lowercase && char.is_letter_uppercase() {
            false
        } else {
            true
        };

        last_char_was_lowercase = char.is_letter_lowercase();
        should_group

or even

        let should_group = !(last_char_was_lowercase && char.is_letter_uppercase());
        last_char_was_lowercase = char.is_letter_lowercase();
        should_group

Should solve your issue ☺️

However, the word OpenSSLError would be segmented as ["Open", "SSLError"] with this change, it would need a bit more efforts to make it segment like ["Open", "SSL", "Error"]. 🤔

hamano commented 5 months ago

I am unsure how "OpenSSLError" should be segment. Perhaps in the context of a program constant name appearing in documentation, it is expected not to be segment. Anyway, I don't understand the use case for the latin-camelcase feature. So, I will simply disable it. What I wanted to report here is that the excessive segmentation of "OpenSSL" into "Open|S|S|L" is likely not expected by anyone.

ManyTheFish commented 5 months ago

Hey @hamano, no problem. I just wanted to say that if you want to use the feature, you can create a PR enhancing the behavior and I will accept it. See you!