psolin / cleanco

Company Name Processor written in Python
MIT License
324 stars 95 forks source link

Iso 20275 api #52

Closed psolin closed 4 years ago

psolin commented 4 years ago

There might be a few files which are not up to date based on what you did today. This is a working version with no term data - please test it out if you could.

petri commented 4 years ago

To make sure the PR is ready for merging, please install pytest, cd to the package root directory and run pytest tests/test_cleanname.py . Once all the review questions are discussed & the tests pass, I'm fine with merging.

psolin commented 4 years ago

Good news: I optimized the code as you wanted. The issues above were resolved in 8264aaf - please take a look.

Bad news: Finland is not in the Python library for iso-20275, so all of the tests are failing. It is in the GLEI's code list 1.1, just not in the cleaned CSV file that the author used to build their library. The author referenced 2017-11-30, while the latest copy is 2019-07-19. So, the library is incomplete, which means that we should have even more matches.

By manually forcing the last CSV from the GLEIF website into their library, I can get "oy" to work:

from cleanco import * test = matches("MyCompany oy", "local_name") print(test) ['Osakeyhtiö']

I forked their repository here: https://github.com/psolin/iso-20275-python

Also, the multi/suffix/prefix is failing, and I'll need to look at it more.

petri commented 4 years ago

I added https://github.com/Gawaboumga/iso-20275-python/issues/1 to get the upstream package updated.

petri commented 4 years ago

The last part of basename check raises IndexError in test because there are empty term parts in source data. I guess this is what you refer to with "multi/suffix/prefix is failing"?

petri commented 4 years ago

There's several places in PR where returned (single) values are enclosed in what seems unnecessary brackets; ie. `return (somevalue)'. Is there a reason for this?

psolin commented 4 years ago

There's several places in PR where returned (single) values are enclosed in what seems unnecessary brackets; ie. `return (somevalue)'. Is there a reason for this?

I removed these in 5a3c3dd. It is a bad habit I think.

The last part of basename check raises IndexError in test because there are empty term parts in source data. I guess this is what you refer to with "multi/suffix/prefix is failing"?

I was able to remove the empty term from popping up in get_terms(), but after trying to force install the package on my own system, the tests still don't seem to work. I'm looking into it still.

psolin commented 4 years ago

iso20275 was updated to 0.0.4, so I updated requirements.txt. All tests pass except for OAO which is no longer a term for a corporation.

petri commented 4 years ago

Interestingly, it seems the ISO20275 data provided by GLEIF has lots of holes. It's missing a whole bunch of common Russian abbreviations for example (such as OAO but not just that). To provide a good library, we should have a mechanism to easily incorporate additions that ISO data is missing. The easiest would probably be some sort of CSV merge mechanism that allows extension and override of the CSV downloaded from GLEIF by another package- or even user-contributed CSV.

This is probably something that should be the responsibility of the iso-20275 package.

psolin commented 4 years ago

I think we're going beyond maintaining this data anymore, so the iso-20275 package could potentially be extended if we wanted to do something for the short term. However, I'm also considering the long term implications of what this means. Since you're more aware of what these abbreviations are, I think you should write the GLEIF and see if they will make an update. I just assume that there may be other systems dependent on this data beyond Python-based packages, so others could be missing out. I mentioned that I was also wanting some descriptors for the entities, such as tax protections, owner protections, etc. so that comparisons could me made - not necessarily a priority, but could be useful to bring up as well.

psolin commented 4 years ago

From a technical standpoint, the code is probably where it should be. However, if it is not truly backwards compatible by pulling all previous terms, then we could still hold off on merging this until we figure out a solution.

petri commented 4 years ago

We can at least check what we are missing by comparing old termdata and the terms in the ISO std. I also realized that the code in this PR seems to use only the localized abbreviations. We should probably include both the localized and non-localized ones- or at least make it an option.

petri commented 4 years ago

Here's result of checking how many terms are in ISO but not in cleanco & vice versa:

psolin commented 4 years ago

This is where I went with the code:

get_unique_terms() pulls in both the local and transliterated abbreviations to assist in cleaning the names. cleanco old api will contain an optional argument for transliterated name vs local in the type(). Default is local. I don't want to keep building on this since it is outdated, but it will be at least an option for those who need it.

business_name = "hello World КДА" x = cleanco(business_name) x.type() ['Командитно дружество с акции'] x.type(localized=False) ['Komanditno druzhestvo s aktsii']

classification() prepares both local and transliterated abbreviations to be searched against. Since we're going beyond just "types" and "countries" here, you can put in anything from the classifiers() list as the source which includes both "local_name" and "transliterated_name".

petri commented 4 years ago

I created some code to support user-specific overrides that can be applied on top of the iso20275 data. It depends on https://github.com/Gawaboumga/iso-20275-python/pull/2 and https://github.com/Gawaboumga/iso-20275-python/pull/5 . I'll submit a PR once all the dependencies are merged. So users will be able to augment the iso20275 data. However I am not sure how to best incorporate the existing cleanco term data. I am guessing that either:

  1. it has to be transformed into a CSV so it can be used as an override to iso20275, or
  2. this PR has to merge with termdata and related code rather than replace it
petri commented 4 years ago

The iso20275 pkg has been updated; there's now also https://github.com/Gawaboumga/iso-20275-python/pull/9 waiting that will let users load an arbitrary CSV.

That could then be used by cleanco, if the termdata were converted into the gleif CSV format: the cleanco termdata could be added in addition to what's in gleif CSV. There'd be duplicate terms, but since we check for those, that would not be a worry.

Or, we could just have two different sources by keeping the current termdata.py support also.

After we fix the issue I raised two days ago, I guess we could merge this PR. If you'd want, I can fix that issue too & proceed with the merge?

petri commented 4 years ago

The iso20275 pkg now supports loading ELF entries from arbitrary reader, which can in practice be a simple iterable data structure. So it could use cleanco terms too, or we could merge data from iso20275 and cleanco and use that as a term source and for classification.

psolin commented 4 years ago

First, I am re-writing this to iterate over all ELF codes in order to capture all abbreviations. I see that it was skipping over entries. I've been hung up on a few things, but I'll have something this week for that.

termdata is not in any format that is compatible to the iso20275 CSV data at the moment, and has to be re-written with proper country names to match ISO 3166, for example. It's not going to have ELF codes either, so I'll have to think of another way to iterate through it. If we're going to reach the widest audience, the data should live with the iso20275 package and we should just not be maintaining it anymore, at least in my opinion. We're trying to follow a standard now as closely as possible, and hopefully that organization will take the extended CSV that will be created soon (I'll take a look) and make it as part of the standard. We probably have a better chance of that happening if we follow their format.

petri commented 4 years ago

Ok. I just pushed a iso3166map branch with countries.csv semicolon-separated file that maps all terms_by_country country names to corresponding 2-letter iso3166 codes. For starters, I guess we could switch to a CSV-based terms definition format for cleanco, using the same column names that iso20275 pkg / gleif uses.

petri commented 4 years ago

Regarding terms by legal form, there are 16 legal forms in cleanco. Here's the ones that have a literal name match in iso20275:

Limited Partnership Corporation Mutual Fund Professional Corporation Limited Liability Limited Partnership Private Company General Partnership Limited Liability Partnership Limited Liability Company Professional Limited Liability Company Sole Proprietorship

psolin commented 4 years ago

I think I need to take a fresh look at this given other changes that have been happening. I'll revisit soon.