Research: Similarity Metric for Sampling

redpanda-ai / Meerkat

Used for the Meerkat project

Other

1 stars 1 forks source link

Research: Similarity Metric for Sampling #659

Open vnagarajY opened 8 years ago

vnagarajY commented 8 years ago

Research if we can use scientific grouping methods to eliminate near duplicate transactions from tagging data to allow coverage of larger sample set for tagging. May even help us with getting faster to longer merchant list

Exampe: https://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/

Definition of done:

Try different metrics for calculating similarity on merchant data
Create a wiki page about research resuslt

OscarDPan commented 8 years ago

Is this 2.1 candidate?

speakerjohnash commented 8 years ago

The key might not be which metric to use, but rather which representation to use. I think cosine similarity on the reshape layer of the CNN might be a really good approach. I think the raw representation might be too affected by minor character changes.

OscarDPan commented 8 years ago

also check Jacard distance https://en.wikipedia.org/wiki/Jaccard_index http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html

Using this you can avoid the one hot encoding used in cos similarity, perhaps.