ENH: for training set preparation add option to drop same names witho…

ing-bank / EntityMatchingModel

Entity Matching Model solves the problem of matching company names between two possibly very large datasets.

https://entitymatchingmodel.readthedocs.io/en/latest/

MIT License

43 stars 3 forks source link

ENH: for training set preparation add option to drop same names witho… #7

Closed mbaak closed 7 months ago

mbaak commented 8 months ago

For a training set creation, in prepare_name_pairs_pd(), added option to remove all equal names that are not considered a match. This can happen a lot in actual data, e.g. with franchises that are independent but do have the same name. So it's a true effect in data, but it screws up our intuitive notion that identical names should be related. E.g. you may want to set this to true for a model without rank features, which evaluates string similarity.