Open pudo opened 1 day ago
Here's a bit of a brainstorm on what a metadata file could look like that enables some of this. It would be minimally normalised for display re-writes, and then we could also generate the current contents of types.yml
from it:
person_name_prefixes:
- "Mr"
- "Ms"
- "Mrs"
- "Mister"
- "Miss"
- "Madam"
- "Madame"
- "Monsieur"
- "Honorable"
- "Honourable"
- "Mme"
- "Mmme"
- "Herr"
- "Hr"
- "Frau"
- "Fr"
- "The"
- "Fräulein"
- "Senor"
- "Senorita"
- "Sheik"
- "Sheikh"
- "Shaikh"
- "Sr"
- "Sir"
- "Lady"
- "The"
basic_stopwords:
- "de"
- "of"
- "and"
- "&"
company_stopwords:
- Company
- Business
- Management
- International
- Intl
- Corporation
- Corp
- Fund
- Holding
- Holdings
- Trading
- Import
- Export
- Trust
- Services
- Industries
- Consulting
- Partner
- Partners
- Solutions
- Group
- Foundation
# - Fdn
- Commercial
company_stopwords_broad:
- Development
- Financial
- Investment
- Investments
company_types:
- simple: GmbH
broader: Ltd
alias:
- Gesellschaft mit beschränkter Haftung
- simple: GmbH & Co. KG
broader: GmbH
Right now,
fingerprints
can only remove company type information from a company name, or generate a shortened form on a very simplified string:Siemens Aktiengesellschaft
->ag siemens
. I'd like to expand that functionality to:a. Enable the simplification/rewrite of long company types such that they can still be shown to the user afterwards (
Siemens Aktiengesellschaft
->Siemens AG
) b. Use that same mapping database to do the strong normalization, perhaps including the ability to choose how "generic" to make the re-write. For example, the Russian company type OOO is sometimes normalised to LLC, which is sort of a radical simplification we could keep as "Level 2" and make optional. c. Have an option to generate simplified company names with stopwords normalised ("Company", "International", etc.)