opensanctions / fingerprints

Make it easier to compare and cross-reference the names of companies and people by applying strong normalisation.
MIT License
145 stars 19 forks source link

Implement company type simplification #20

Open pudo opened 1 day ago

pudo commented 1 day ago

Right now, fingerprints can only remove company type information from a company name, or generate a shortened form on a very simplified string: Siemens Aktiengesellschaft -> ag siemens. I'd like to expand that functionality to:

a. Enable the simplification/rewrite of long company types such that they can still be shown to the user afterwards (Siemens Aktiengesellschaft -> Siemens AG) b. Use that same mapping database to do the strong normalization, perhaps including the ability to choose how "generic" to make the re-write. For example, the Russian company type OOO is sometimes normalised to LLC, which is sort of a radical simplification we could keep as "Level 2" and make optional. c. Have an option to generate simplified company names with stopwords normalised ("Company", "International", etc.)

pudo commented 1 day ago

Here's a bit of a brainstorm on what a metadata file could look like that enables some of this. It would be minimally normalised for display re-writes, and then we could also generate the current contents of types.yml from it:

person_name_prefixes:
  - "Mr"
  - "Ms"
  - "Mrs"
  - "Mister"
  - "Miss"
  - "Madam"
  - "Madame"
  - "Monsieur"
  - "Honorable"
  - "Honourable"
  - "Mme"
  - "Mmme"
  - "Herr"
  - "Hr"
  - "Frau"
  - "Fr"
  - "The"
  - "Fräulein"
  - "Senor"
  - "Senorita"
  - "Sheik"
  - "Sheikh"
  - "Shaikh"
  - "Sr"
  - "Sir"
  - "Lady"
  - "The"
basic_stopwords:
  - "de"
  - "of"
  - "and"
  - "&"
company_stopwords:
  - Company
  - Business
  - Management
  - International
  - Intl
  - Corporation
  - Corp
  - Fund
  - Holding
  - Holdings
  - Trading
  - Import
  - Export
  - Trust
  - Services
  - Industries
  - Consulting
  - Partner
  - Partners
  - Solutions
  - Group
  - Foundation
  # - Fdn
  - Commercial
company_stopwords_broad:
  - Development
  - Financial
  - Investment
  - Investments
company_types:
  - simple: GmbH
    broader: Ltd
    alias:
      - Gesellschaft mit beschränkter Haftung
  - simple: GmbH & Co. KG
    broader: GmbH