sjyk / sampleclean-async

http://sampleclean.org
Apache License 2.0
92 stars 27 forks source link

Using Multiple Similarity Metrics and Features for SVM and RandomForest Model #64

Open wu-s-john opened 8 years ago

wu-s-john commented 8 years ago

Hi,

I was reading this documentation (http://sampleclean.org/guide/) and I see that you can use any similarity metric to find the similarity between two strings on one column attribute. Can you use multiple similarity metrics to find the similarity between two strings rather than one? If so, how can you include multiple similarity metrics?

Also, what is the matrix that is fed into SVM and RandomForest? What are the columns for this matrix. Are the values different string metrics?

sjyk commented 8 years ago

Hi, The exposed API in the guide is a subset of the possible things you can do. See the scala docs (esp. http://sampleclean.org/api/#sampleclean.clean.featurize.AnnotatedSimilarityFeaturizer, http://sampleclean.org/api/#sampleclean.clean.featurize.Featurizer).

You can define metrics between a set of strings and use the included libraries for similarity--however, there is no guarantee that our internal optimizations such a prefix filtering will hold.

The learning for deduplication learns a discriminative model given a feature vector representing similarities between strings. For N data, there are N^2 similarities, so a subset L \subset N^2 are labeled. The features are an ensemble of similarity metrics comparing the strings. However, this is flexible as well and you are free to write your own featurizer. The Active Learning should be agnostic to the choice of featurization.

wu-s-john commented 8 years ago

Thank you for the response. I see in the API that it returns a list of R^d elements. By default, if I feed the system a list of strings coming from one column or attribute, would it use an ensemble of string metrics? If so, what are the metrics? Also, can you show a brief example of using AnnotatedSimilarityFeaturizer and Featurizer and how I can import my own similarity metrics into these functions. Specifically, if I have the metrics, Jaro Distance, edit distance and LCS, how would I use these abstract classes to make my own class.