w3c / string-search

Parking lot for advice on internationalization related string searching in general content
https://w3c.github.io/string-search/
3 stars 10 forks source link

Web page in multiple languages #13

Open xfq opened 2 years ago

xfq commented 2 years ago

What's the expected behaviour if a web page contains multiple languages? For example, if a page contains Chinese and Japanese, the segmentation process and full-text indexes could be different. Even the same code point sequences may be segmented differently depending on whether it's ja or zh.

r12a commented 2 years ago

@xfq i'm not sure what the problem is here.

aphillips commented 2 years ago

@xfq If one does true full-text search on a page in multiple languages (as opposed to sub-string matching, which is the primary topic of our document), then the segmentation, stemming, and other processing (such as named entity recognition) of the corpus should be matched to the language of each block of text--i.e. word segmentation on ja is different from that on zh.

When search terms are entered against a multilingual index, it may be necessary to do "explosive stemming" (multiple stemming processes using the rules for the various languages in the corpus) or other types of processing to try to match the search terms against the indices.

FTS is complicated.

As @r12a asks, what is the problem here (with respect to our text)? 😉 Happy to accept suggestions.

xfq commented 2 years ago

I think it should be pointed out that if a piece of text contains multiple languages, then the search for the text needs to be adapted to support multiple languages, not just the primary language.