nickdavidhaynes / spacy-cld

Language detection extension for spaCy 2.0+
MIT License
111 stars 9 forks source link

Interpretation of score #11

Open Natalie-Caruana opened 2 years ago

Natalie-Caruana commented 2 years ago

Hi, pycld2 detect function with returnVectors set to False returns four arguments. As I understand, (assuming one language detected) the confidence score of spacy-cld is calculated by dividing the third value in the third argument returned by pycld2, by 100 i.e.

reliable,textBytesFound,details,vectors=cld2.detect(text)

spacy_score = details[0][2]/100

However in pycld2's detect function documentation the third argument details is explained as follows:

details: tuple Tuple of up to three detected languages, where each is tuple is (languageName, languageCode, percent, score). percent is what percentage of the original text was detected as this language and score is the confidence score for that language. So if percent means the percentage of the original text detected, then this is not related to how good the prediction was. Shouldn't some form of normalization be done on the fourth argument score instead?

Thanks