cleaning company names by
1) removing special chars / lower casing / stripping and so on
2) removing high scoring idf terms using deterministic method such as 'ltd', 'group', 'international'
this reduces the number of unique companies from 18k to ~12k
an open issue in this topic is that some companies names should be merged
e.g.
24 7
24 seven
247
but doing that automatically while preserving companies that should NOT be merged is kind of hard and does more harm than good
e.g.
cleaning company names by 1) removing special chars / lower casing / stripping and so on 2) removing high scoring idf terms using deterministic method such as 'ltd', 'group', 'international'
this reduces the number of unique companies from 18k to ~12k
an open issue in this topic is that some companies names should be merged e.g.
but doing that automatically while preserving companies that should NOT be merged is kind of hard and does more harm than good e.g.