psolin / cleanco

Company Name Processor written in Python
MIT License
322 stars 95 forks source link

Getting rid of abbreviations #14

Closed psolin closed 9 years ago

psolin commented 9 years ago

Just wanted to have some thoughts on this. They seem like they could go beyond the scope of the project.

petri commented 9 years ago

I guess it's then worth asking / clarifying / elaborating a bit more on what exactly is the scope of the project? But I agree - expanding the abbreviations does not necessarily serve the purpose of stripping & clearing the organization names down to their bare minimum essence?

I guess I would make abbreviation expansion optional as a first step, if it isn't, yet.

petri commented 9 years ago

Elaborating on the scope, it occurred to me that it would probably be a good idea to first release the business type name data as a separate library. That would make it easy for others to utilize and build on it, would you agree? The library could be called "legalentities" or "organizationtypes", for example. Or "business-entities".

Cleanco could then be a separate library that uses the data to clean up the names.

psolin commented 9 years ago

Abbreviations are another area that should be left up to the end user. I think we both get the feeling that it is a place where the code doesn't belong. It served its purpose for me at some point, but with a wider audience now, it just doesn't make sense.

The scope of the project should just be to clean up a business name based on the many business entity suffixes out there. Everything else is secondary in my mind. Since we are dealing with a data, we can obviously return possible countries / entity types easily. I wouldn't mind adding more parameters, too, like checking prefixes. I've been working on this.

The data should at least be split up into different files, I think we agree with this -- it shouldn't be mixed in with the code that processes it. I honestly don't know what the best way would be to do this, but you mentioned a few ideas.

There are other projects on Github that are straight-up data projects, that's true. There are US zip codes, ones that keep track of countries. I agree that having one to track business entities in countries would be useful. Countries come into existence and merge every year, so keeping track this separately would make the script more accurate. As far as one country changing its laws to get rid of or add entities, I don't think that this happens too often (though I could be wrong).

I believe that a user should still be able to download the script and get some functionality from it, at least for now.

petri commented 9 years ago

Very well. I removed the abbreviation code then, so this can perhaps be closed. Also note that the data has now been split to a separate module.