psolin / cleanco

Company Name Processor written in Python
MIT License
324 stars 95 forks source link

improved term normalization #56

Closed twalen closed 4 years ago

twalen commented 4 years ago
petri commented 4 years ago

Thanks for a nice contribution. If I got it right, this PR (among other things), overcomes deficiencies in out-of-the-box unicode normalization by handling troublesome characters manually?

Regarding removing dots, commas and dashes from terms, why is that necessary? Are we removing those from names but not from terms? Or...?

twalen commented 4 years ago

Yes, this handles some characters manually (according to the map in NON_NFKD_MAP).

Dots, commas, etc:

petri commented 4 years ago

Seems good to me.