Closed amueller closed 2 years ago
Ha! That's a good remark.
There is a link to string kernels, but it's not as trivial as doing an expansion of a string kernel, because similarity encoding takes in account the "shape" of the categories. The correspond kernel does not boil down to a string kernel. If you look at the paper (https://hal.inria.fr/hal-01806175), equation 12 gives the expression of the kernel:
<d_i , d_j>_sim = 𝚺_l sim(d_i , d_l ) sim(d_j , d_l )
Where "sim" is the similarity used by the similarity encoding, d_i and d_j are two instances, and the sum goes over the reference categories.
Rather than a Nystrom expansion of the kernel, this is an expansion over prototypes, without whitening. Basically, it captures the "covariance" of the strings in the training set.
However, you have a good intuition that following this lead can help understanding what similarity encoding is doing. We have been doing this lately, and this has enabled us to come up with a very different formulation that is faster in the sens that it does not have to compute costly string similarities, to sum over a large number of prototypes, and that it can be implemented online. We're submitting this to KDD in a week, and the code will follow online soon.
To be fair, validating these things is a lot of work (and the new submission will have twice as many datasets!).
Happy to keep you posted, and maybe to discuss this IRL.
I only thought about this on the subway but the difference to Nystroem is a square root or something like that, right? Looking forward to your new paper!
I only thought about this on the subway but the difference to Nystroem is a square root or something like that, right?
Yes indeed.
Looking forward to your new paper!
It's actually starting from very different hypotheses. So I am not sure that we will frame this as an expansion of a string similarity. But we do establish the link to string similarities.
OK, we need to finish writing it :D
I'm closing this issue, as I feel that there is not action needed on our side.
For the record, the paper was never published: we submitted something about a related Fisher kernel to a few conferences, and the reviewers never liked it. We eventually decided that it was not very interesting :).
I feel like there is a very strong link to kernel methods here. You're basically using a string kernel on the dirty categories. If you'd do PCA of the similarity matrix this would be nearly the same as a nystoem embedding on this kernel. Have you looked at related literature from that angle?