Closed hamano closed 3 weeks ago
Hello @hamano, What do you expect in terms of segmentation?
Thank you!
It is ideal that proper nouns are not split, but if it is not in the dictionary, it can't avoided that
Here, I recognize the issue as splitting it as OpenSSL
would be split as Open|SSL
.Open|S|S|L
. The word SSL
would disappear.
The inability to segment words not found in the dictionary was a characteristic in Japanese. Please ignore that. In any case, I believe it is desirable that 'OpenSSL' not be segment, even in Latin languages
The word open_ssl
is not common, but for example, in the case of a module like apache mod_ssl
, it is expected not to be segmented.
Hello @hamano, I think the better way to solve this issue is to add the word SSL to the tokenizer dictionary. It seems to be a specific case more than a generality.
I'm concerned not just about the term "OpenSSL" but about countless similar terms.
OpenVPN
OpenBSD
OpenJDK
OpenCV
FreeRADIUS
FreeBSD
PostgreSQL
MySQL
MongoDB
I don't think all these terms should be added to the dictionary. New terms emerge one after another.
If you say this is the expected behavior of the latin-camelcase feature, then that's fine. I'll disable it. However, it's too inconvenient as a default feature, so I reported it, especially for use in technical documentation.
Understood, we may disable it from the default features, thinking of it, this segmentation is mainly used for the cases you are describing, so if you want to change the behavior of the segmenter, I would gladly accept a new PR. Moreover, the change seems easy, maybe replacing the highlighted code block by:
let should_group = if last_char_was_lowercase && char.is_letter_uppercase() {
false
} else {
true
};
last_char_was_lowercase = char.is_letter_lowercase();
should_group
or even
let should_group = !(last_char_was_lowercase && char.is_letter_uppercase());
last_char_was_lowercase = char.is_letter_lowercase();
should_group
Should solve your issue ☺️
However, the word OpenSSLError
would be segmented as ["Open", "SSLError"]
with this change, it would need a bit more efforts to make it segment like ["Open", "SSL", "Error"]
. 🤔
I am unsure how "OpenSSLError" should be segment. Perhaps in the context of a program constant name appearing in documentation, it is expected not to be segment. Anyway, I don't understand the use case for the latin-camelcase feature. So, I will simply disable it. What I wanted to report here is that the excessive segmentation of "OpenSSL" into "Open|S|S|L" is likely not expected by anyone.
Hey @hamano, no problem. I just wanted to say that if you want to use the feature, you can create a PR enhancing the behavior and I will accept it. See you!
The default featre has issue with proper noun segmentation like
OpenSSL
.main.rs:
default feature:
disable default feature