maxentile / msm-learn

Learning how to learn Markov State Models of conformational dynamics
MIT License
6 stars 2 forks source link

Metric learning #12

Open maxentile opened 9 years ago

maxentile commented 9 years ago

Kinetic discriminatory metric learning: http://pubs.acs.org/doi/abs/10.1021/ct400132h

Questions:

maxentile commented 9 years ago

"Large-margin component analysis" -- combining dimensionality reduction and metric learning-- http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_791.pdf

maxentile commented 9 years ago

Distance metric learning a comprehensive survey: http://vipl.ict.ac.cn/homepage/ACCV14Metric/ref/1_Distance%20metric%20learning%20a%20comprehensive%20survey_TR2006.pdf

maxentile commented 9 years ago

Metric learning: A survey: http://web.cse.ohio-state.edu/~kulis/pubs/ftml_metric_learning.pdf

maxentile commented 9 years ago

Nonlinear metric learning: http://www1.cse.wustl.edu/~kilian/papers/chilmnn.pdf

maxentile commented 9 years ago

Other ideas for follow-on work:

maxentile commented 9 years ago

Bayesian active distance metric learning: http://www.cs.cmu.edu/~liuy/uai2007_bayesian.pdf

Given two arbitrary points in configuration space, the "kinetic distance" between them can be approximately determined, but at some cost. We effectively want to learn a good kinetic distance metric over all of configuration space. Seems like it could be a good fit for this approach...

maxentile commented 9 years ago

Metric learning for kernel regression: http://jmlr.org/proceedings/papers/v2/weinberger07a/weinberger07a.pdf e.g. regressing on time lag

maxentile commented 9 years ago

Kernel regression with sparse metric learning: http://www.cst.ecnu.edu.cn/~slsun/pubs/KRSML.pdf

maxentile commented 9 years ago

Large-margin multitask metric learning: http://papers.nips.cc/paper/3935-large-margin-multi-task-metric-learning.pdf

maxentile commented 9 years ago

Idea: "Tree-preserving metric learning"

In LMNN, we consider triplets (i,j,k) of points where (i,j) belong to the same class and k belongs to a different class, and we want to learn a metric that places pairs (i,j) closer together than pairs (i,k), plus a constant margin. We could modify this approach to accept hierarchical cluster assignments instead of class labelings.

maxentile commented 9 years ago

For LMNN, is there a way to supply an arbitrary loss function, rather than 0/1 loss? E.g. if we have 3 classes A,B,C, can I say that mistakenly putting A close to B is worse than putting A close to C? If so, that would be nice: "tree-preserving metric learning" would just be a special case where the loss function is "cophenetic distance" or something.

Tactic: one approach might be simply to replace the "1" in the SDP constraint with the desired margin for each triple: image (ref: http://www.cse.wustl.edu/~kilian/papers/spm2010.pdf p. 4)

maxentile commented 9 years ago

There would be several cool applications for solving this general problem. (1) Kinetic discriminatory metric learning: http://pubs.acs.org/doi/abs/10.1021/ct400132h - Given a time-ordered sequence of feature-vectors, we want to find a distance metric on the feature-vectors that preserves "kinetic distance" i.e. observations that are close in time should be close according to the learned metric. Previous approaches used a 0/1 loss at fixed lag times, but we could presumably do better if errors were weighted by the magnitude of the time lag. (2) "Tree-preserving metric learning" - Given a hierarchical cluster tree, find a distance metric such that points far apart in the tree are far apart in the metric.

maxentile commented 9 years ago

Could apply this to learning the weight matrix M of what I'll call the "weighted Binet-Cauchy kernel"

The Binet-Cauchy kernel is:

from numpy import sqrt,eye
from numpy.linalg import det

def BC(X,Y):
    return det(X.T.dot(Y)) / sqrt(det(X.T.dot(X)) * det(Y.T.dot(Y)))

where X and Y are matrices.

When X and Y are _n_x3 matrices containing residue coordinates, this computes something similar to RMSD, but has a couple advantages: it's faster, simpler, and it's definitely a kernel. (RMSD is probably also a kernel, but I'm not sure.)

We can weight pairs of residues using a matrix M:

def weighted_BC(X,Y,M=None):
    if M==None:
        M = eye(len(X))

    return det(X.T.dot(M).dot(Y)) / sqrt(det(X.T.dot(M).dot(X)) * det(Y.T.dot(M).dot(Y)))

Question: what constraints must M satisfy for weighted_BC to be a valid kernel? Guess: as long as M is psd, it should be okay?

maxentile commented 9 years ago

Neighborhood component analysis: http://papers.nips.cc/paper/2566-neighbourhood-components-analysis.pdf -- see section 5 for extensions to continuous labels (e.g. time-lags)

maxentile commented 9 years ago

For fast tinkering I'm just writing down objective functions and then using autograd to fit it by gradient descent, rather than committing to a fancy learning technique earlier