psolin / cleanco

Company Name Processor written in Python
MIT License
322 stars 95 forks source link

Handle prefixed (and in-middle), possibly multiple terms #8

Closed petri closed 9 years ago

petri commented 9 years ago

In Finland, you sometimes see the format "Oy Corporation Ab" where "Oy" refers to limited liability (in Finnish) and "Ab" the same (in Swedish, the other official language of Finland).

In other words, the abbreviations can also appear in front of the company name - or both before and after.

psolin commented 9 years ago

I'll probably have to write a separate little bit of code to tackle this.

The code is very much ethnocentric as it stands.

petri commented 9 years ago

Ok. I just found out that in Finland the terms ("Oy" for example) can appear also in the middle. It's not very common, but not exactly very rare either.

I would like to suggest that terms be removed wherever they are found, as long as they are wrapped in name start/end and/or whitespace.

But given that the library removes all terms it finds in its database (ie. terms for all countries), I wonder if some legitimate company name parts will then accidentally be removed? On the other hand, that is a possibility now already, albeit a smaller one.

Perhaps an option could be added to the library to remedy that so that just an user-selectable portion of the dataset could be optionally used for terms that be removed?

If this is ok, I can provide an implementation.