Closed Rom888 closed 1 year ago
Detection of multiple languages sometimes returns indices in bytes, but sometimes in runes (code points): To reproduce:
package main import ( "fmt" "github.com/pemistahl/lingua-go" ) func main() { sentence := "" fmt.Printf("--- this will return indices in bytes:") sentence = "Parlez çççç? I would like" split(sentence); fmt.Printf("\n\n") fmt.Printf("--- this will return indices in code points (runes):") sentence = "ççççfran" split(sentence); } func split(sentence string) { languages := []lingua.Language{ lingua.English, lingua.French, } detector := lingua.NewLanguageDetectorBuilder(). FromLanguages(languages...). // WithLowAccuracyMode(). Build() detectionResults := detector.DetectMultipleLanguagesOf(sentence) fmt.Printf("\ninput str:\n%s\n", sentence) for i := 0; i < len(sentence); i++ { fmt.Printf("% x", sentence[i]) // fmt.Printf("%q", sentence[i]) } fmt.Printf("\n") for _, result := range detectionResults { fmt.Printf("\n%s %d %d :\n", result.Language(), result.StartIndex(), result.EndIndex()) fmt.Printf("%s: '%s'\n", result.Language(), sentence[result.StartIndex():result.EndIndex()]) fmt.Printf("%s: '%s'\n", result.Language(), string([]rune(sentence)[result.StartIndex():result.EndIndex()])) } }
output:
--- this will return indices in bytes: input str: Parlez çççç? I would like 50 61 72 6c 65 7a 20 c3 a7 c3 a7 c3 a7 c3 a7 3f 20 49 20 77 6f 75 6c 64 20 6c 69 6b 65 French 0 17 : French: 'Parlez çççç? ' French: 'Parlez çççç? I wo' English 17 29 : English: 'I would like' English: 'uld like' --- this will return indices in code points (runes): input str: ççççfran c3 a7 c3 a7 c3 a7 c3 a7 66 72 61 6e French 0 8 : French: 'çççç' French: 'ççççfran'
Thank you @Rom888 for your bug report. I've fixed it now (better late than never).
Detection of multiple languages sometimes returns indices in bytes, but sometimes in runes (code points): To reproduce:
output: