Problems parsing company names with punctuations

psolin / cleanco

Company Name Processor written in Python

MIT License

327 stars 98 forks source link

Problems parsing company names with punctuations #29

Open rychoo2 opened 7 years ago

rychoo2 commented 7 years ago

Hello,

Very nice module but it doesn't always handle well some real human entered company names we deal a lot with. Below some obvious examples where the name is not parsed:

LIBGAS,LTD -> LIBGAS,LTD AIRDAS USA,LLC -> AIRDAS USA,LLC GF LOGISTICS.INC -> GF LOGISTICS.INC HAKUTATZ.TECH.CO.,LTD. -> HAKUTATZ.TECH.CO.,LTD

Thanks

petri commented 7 years ago

Perhaps the issue here is that there is no space between the name and the suffix? What countries are these companies based in?

rychoo2 commented 7 years ago

Correct, as long as there is a white space it is parsed ok. These companies are based in USA and China but I believe the key is that probably the data was entered in China where they're not used to white spaces. I believe the library could be immune to that.

psolin commented 5 years ago

I see how this could be an issue, but only because you didn't clean up your data first. What is typical is that there is whitespace and then the entity abbreviation. That is how everyone writes these business name strings. I don't think the script should look for whitespace and/or any non character symbol and then run a lookup; I don't think it is responsible for adding spaces after symbols either.

Edit: Yes, spaces and a trailing comma are removed, only because (again) this is a standard way to write a business name.

petri commented 5 years ago

I have seen the entity abbreviation being separated by a comma (more often comma + whitespace, actually). Although I'd agree that whitespace (no comma) is a more common separator.

I guess we could replace commas with whitespace as a preprocessing step? I am a little surprised we did not already have this :) In any case, I don't have time to work on that.

As @psolin pointed out, replacing commas with whitespace would probably be an easy data cleanup workaround.