pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.14k stars 45 forks source link

Use softmax function instead of min-max normalization #99

Closed Alex-Kopylov closed 1 year ago

Alex-Kopylov commented 1 year ago

What do you think about passing results to softmax function instead min-max normalization? I think it's more clear way. Because, for example, you can have a threshold to filter-out unidentified languages.

Is there are some pitfalls that aren't clear for me? I've implemented this by slightly changing your code. I've also rounded results.

It passed black and mypy, but not tests. It's throwing me error like: INTERNALERROR> UnicodeEncodeError: 'charmap' codec can't encode characters in position 712-720: character maps to <undefined>

pemistahl commented 1 year ago

Hi @Alex-Kopylov, thank you for your pull request. I thought that min-max normalization would be a reasonable choice but it is certainly possible that there is a better normalization method which I have not tried yet.

Why have you closed your PR already? The failing unit tests should be easy to fix, as far as I can see in the CI pipeline. I'm going to reopen the PR now and check whether the softmax normalization is a better fit for the confidence values.

Thanks again for your contribution. I appreciate this a lot. :)

Alex-Kopylov commented 1 year ago

I closed it accidentally. Glad to hear that you're taking these changes into account. I'm going to play with different approaches more and will inform you if there will be something interesting.