skrub-data / skrub

Prepping tables for machine learning
https://skrub-data.org/
BSD 3-Clause "New" or "Revised" License
1.21k stars 97 forks source link

Link to kernel methods #100

Closed amueller closed 2 years ago

amueller commented 5 years ago

I feel like there is a very strong link to kernel methods here. You're basically using a string kernel on the dirty categories. If you'd do PCA of the similarity matrix this would be nearly the same as a nystoem embedding on this kernel. Have you looked at related literature from that angle?

GaelVaroquaux commented 5 years ago

Ha! That's a good remark.

There is a link to string kernels, but it's not as trivial as doing an expansion of a string kernel, because similarity encoding takes in account the "shape" of the categories. The correspond kernel does not boil down to a string kernel. If you look at the paper (https://hal.inria.fr/hal-01806175), equation 12 gives the expression of the kernel:

<d_i , d_j>_sim = 𝚺_l sim(d_i , d_l ) sim(d_j , d_l )

Where "sim" is the similarity used by the similarity encoding, d_i and d_j are two instances, and the sum goes over the reference categories.

Rather than a Nystrom expansion of the kernel, this is an expansion over prototypes, without whitening. Basically, it captures the "covariance" of the strings in the training set.

However, you have a good intuition that following this lead can help understanding what similarity encoding is doing. We have been doing this lately, and this has enabled us to come up with a very different formulation that is faster in the sens that it does not have to compute costly string similarities, to sum over a large number of prototypes, and that it can be implemented online. We're submitting this to KDD in a week, and the code will follow online soon.

To be fair, validating these things is a lot of work (and the new submission will have twice as many datasets!).

Happy to keep you posted, and maybe to discuss this IRL.

amueller commented 5 years ago

I only thought about this on the subway but the difference to Nystroem is a square root or something like that, right? Looking forward to your new paper!

GaelVaroquaux commented 5 years ago

I only thought about this on the subway but the difference to Nystroem is a square root or something like that, right?

Yes indeed.

Looking forward to your new paper!

It's actually starting from very different hypotheses. So I am not sure that we will frame this as an expansion of a string similarity. But we do establish the link to string similarities.

OK, we need to finish writing it :D

GaelVaroquaux commented 2 years ago

I'm closing this issue, as I feel that there is not action needed on our side.

For the record, the paper was never published: we submitted something about a related Fisher kernel to a few conferences, and the reviewers never liked it. We eventually decided that it was not very interesting :).