pemistahl / lingua-go

The most accurate natural language detection library for Go, suitable for short text and mixed-language text
Apache License 2.0
1.19k stars 66 forks source link

Detection of multiple languages: bytes, runes #43

Closed Rom888 closed 1 year ago

Rom888 commented 1 year ago

Detection of multiple languages sometimes returns indices in bytes, but sometimes in runes (code points): To reproduce:

package main

import (
  "fmt"
  "github.com/pemistahl/lingua-go"
)

func main() {
  sentence := ""
  fmt.Printf("--- this will return indices in bytes:")
  sentence = "Parlez çççç? I would like"
  split(sentence);

  fmt.Printf("\n\n")
  fmt.Printf("--- this will return indices in code points (runes):")
  sentence = "ççççfran"
  split(sentence);
}

func split(sentence string) {
  languages := []lingua.Language{
    lingua.English,
    lingua.French,
  }

  detector := lingua.NewLanguageDetectorBuilder().
    FromLanguages(languages...).
    // WithLowAccuracyMode().
    Build()
  detectionResults := detector.DetectMultipleLanguagesOf(sentence)

  fmt.Printf("\ninput str:\n%s\n", sentence)

  for i := 0; i < len(sentence); i++ {
    fmt.Printf("% x", sentence[i])
    // fmt.Printf("%q", sentence[i])
  }
  fmt.Printf("\n")

  for _, result := range detectionResults {
    fmt.Printf("\n%s %d %d :\n", result.Language(), result.StartIndex(), result.EndIndex())

    fmt.Printf("%s: '%s'\n", result.Language(), sentence[result.StartIndex():result.EndIndex()])

    fmt.Printf("%s: '%s'\n", result.Language(), string([]rune(sentence)[result.StartIndex():result.EndIndex()]))
  }
}

output:

--- this will return indices in bytes:
input str:
Parlez çççç? I would like
 50 61 72 6c 65 7a 20 c3 a7 c3 a7 c3 a7 c3 a7 3f 20 49 20 77 6f 75 6c 64 20 6c 69 6b 65

French 0 17 :
French: 'Parlez çççç? '
French: 'Parlez çççç? I wo'

English 17 29 :
English: 'I would like'
English: 'uld like'

--- this will return indices in code points (runes):
input str:
ççççfran
 c3 a7 c3 a7 c3 a7 c3 a7 66 72 61 6e

French 0 8 :
French: 'çççç'
French: 'ççççfran'
pemistahl commented 1 year ago

Thank you @Rom888 for your bug report. I've fixed it now (better late than never).