psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
54 stars 36 forks source link

machine learning for automatic heuristics #193

Open matsen opened 8 years ago

matsen commented 8 years ago

Could we use machine learning to automatically generate heuristics for clustering? We have a "gold standard" now with full partis, and the goal of more approximate clustering should be to replicate that. Thus, how about having the computers do the work of figuring out how to do that best?

I'll bet that we could just throw in

into a random forest classifier and have it pop out a nice predictor of whether two sequences fit in the same clonal family that didn't require any expensive operations. Then we could use those classifiers for more clustering.

I specifically don't want this to be something that we run once and then get parameter values which are frozen for the rest of time. Rather, I hope it could be a partis command to generate these heuristics.

psathyrella commented 8 years ago

Yeah, sounds good.

I suppose I won't use the ROOT implementation that I know and love/hate...

related: #171 and #176

matsen commented 8 years ago

Thinking about it more, logistic regression would probably be a better fit than RF. But perhaps we should just try both.

matsen commented 8 years ago

Vladimir points out that there are methods for doing feature selection and clustering at the same time.