Open vnagarajY opened 8 years ago
Is this 2.1 candidate?
The key might not be which metric to use, but rather which representation to use. I think cosine similarity on the reshape layer of the CNN might be a really good approach. I think the raw representation might be too affected by minor character changes.
also check Jacard distance https://en.wikipedia.org/wiki/Jaccard_index http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html
Using this you can avoid the one hot encoding used in cos similarity, perhaps.
Research if we can use scientific grouping methods to eliminate near duplicate transactions from tagging data to allow coverage of larger sample set for tagging. May even help us with getting faster to longer merchant list
Exampe: https://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
Definition of done: