Closed tiavlovskiegor24 closed 7 years ago
@tiavlovskiegor24 I realized that, maybe, we can kinda use the feature selection methods in scikit-learn to give scores to the features. The function that do that with regression like data is mutual_info_regression. I've been trying to run it over the complete data set but my 8gb of RAM are not enough :_( and running it on a subset is nonsense since the scores need all the examples to be computed correctly.
What do you think?
P.D: Some documentation: Mutual information wikipedia Mutual Information between Discrete and Continuous Data Sets
I would also suggest simplifying our problem from regression to classification. We simply take the brightness of the original protein as a benchmark and give label 1 to the mutants which have brightness no less than 95% (or any other selected percentage) of benchmark brightness, and 0 otherwise.
So in this way we will learn to identify individual mutations which are not expected to decrease the brightness to much from the original. It seems much easier and robust than predicting the actual brightness of individual mutation. And with classification it is much easier to measure performance (accuracy score,f1, recall etc.) and we can extact operating point of the classifier or make it give us its probability bets on each mutation.
With regression there is only least squares measure which can be very sensitive to noise.
What do you think?
sounds good to me!
I like the idea. But, unfortunately, 'mutations which are not expected to decrease the brightness too much' wouldn't work. Introducing these mutations simultaneously (which is the aim) will most probably have a cumulative large negative effect that will decrease the brightness substantially :(
Well, we can identify beneficial single mutations, pairs or triplets or anything really using the approach above . I just used single mutations as an example because this is the least we have to do with our model.
Essentially my suggestion boils down to predicting whether a supplied mutant (which can have single or multiple mutations) is expected to have brightness above a certain threshold (which we set as we wish and we can also run different thresholds simultaneously). Trying to predict exact brightness instead would be a hell of a mess from what I reckon at the moment.
@tiavlovskiegor24 Ok, sound good then! I am actually currently running a neural net on aws, while extracting the weights. The configuration which I am using served well before in predicting the folding energy of the protein. Therefore, I believe it must do a good job predicting the weights. But if the approach you suggest gives similar results, then most definitely we should stick to it. I anticipate using deep learning must be an overkill for this task.
So far running Decision trees or SVM's didn't yield any significant results, yet I don't quite know how SVM behaves when it has to fit a regression to a set of binary features... I tried reformulating our problem in terms of classification (1 if mutant's brightness is within 3 std of our benchmark brightness value of original protein, 0 otherwise) and get like 80% prediction accuracy. But still need to check if it produces a sensible decision boundary.