psolin / cleanco

Company Name Processor written in Python
MIT License
322 stars 95 forks source link

use ISO 20275 data from GLEIF #32

Open petri opened 7 years ago

petri commented 7 years ago

See https://www.gleif.org/en. There's a lot of data that would help improve the legal affix database of cleanco.

psolin commented 5 years ago

The ELF Code List definitely has more abbreviations: https://www.gleif.org/en/about-lei/code-lists

I am just not sure what the equivalents are in some of the languages to US/UK. However, there may be some that have been missed which are more obvious. I will keep a note of this.

petri commented 4 years ago

I suspected someone might have done this by now, and sure enough: https://pypi.org/project/iso-20275 .

Since 2017, there now exists ISO standard 20275 ‘Financial Services – Entity Legal Forms (ELF).

psolin commented 4 years ago

Cleanco was still built to ID entity types in strings, so I think it’s fine to move towards incorporating this package. It was only a matter of time before the data was standardized and put into a python package. Moving away from solely being US/UK based and towards an international standard is for the best for this package.

If incorporated, it would fix most of our open issues as well. I’ll look into doing this.

petri commented 4 years ago

For getting the base name without legal term affixes, the unique terms list from the ISO standard should probably be patched in here: https://github.com/psolin/cleanco/blob/master/cleanco/clean.py#L25-L29

petri commented 4 years ago

This could be broken into two or three different tickets;

psolin commented 4 years ago

Just to give you an idea of where this is going - I am counting 1,180 unique business entity affixes in this package to our 202. These are the classifiers (properties) that they use as well:

['alpha2', 'alpha2_2', 'country', 'creation_date', 'elf', 'jurisdiction', 'local_abbreviations', 'local_name', 'modification', 'modification_date', 'reason', 'status', 'transliterated_abbreviations', 'transliterated_name']

petri commented 4 years ago

Given we now understand more the differences between iso20275 data and cleanco termdata, it seems to me we need a decisions on data strategy. The current PR gets rid of cleanco termdata in favour of iso20275. But in hindsight it seems to me that instead, iso20275 should be used just a primary, but not exclusive source.

On the other hand, both iso20275 and clanco also need a mechanism by which users can use their own legal form data if needed. It would make sense if both packages used the same mechanisms and formats.

Thoughts?

FBnil commented 2 years ago

Replying to your "Thoughts?", At first I was happy, for example, Netherlands has all the forms included in cleanco. But then Japanese does not have the romanji versions (Y.K. - which termdata will have, if a pull request is accepted), only the kanji versions (有 and only the first character of 有限会社, which I don't know if it's written out like that - But in Chinese data, it's written out).

https://en.wikipedia.org/wiki/Y%C5%ABgen_gaisha

And even Dutch is incomplete; for example, "Foundation": "V44D","Netherlands","NL","","","stichting","Dutch","nl","stichting","","","2017-11-30","ACTV","","",""

Looking it up it seems that "st." is the official one and fdn (and lesser: fndn. or fou.) Although in practice the word is written out full, because hey, you want to state clearly you are a foundation.

Thus, in my conclusion, there is still not a good list and I join @petri that maybe both lists need to be eligible. Or at least that we can merge the differences into a new version of iso20275 including many missed data that termdata does have, and then we can use that as a master list.

In practice it means we need to fix the bug where custom_basename() is unusable in it's current state and let users add their settings in an easy way, without jumping through hoops.