katyaputintseva / ubird

Awesome team's project for Agile DS
2 stars 2 forks source link

Run SVM (with different kernels) on the sparse dataset. Report the results. #18

Closed tiavlovskiegor24 closed 7 years ago

tiavlovskiegor24 commented 7 years ago

So far running Decision trees or SVM's didn't yield any significant results, yet I don't quite know how SVM behaves when it has to fit a regression to a set of binary features... I tried reformulating our problem in terms of classification (1 if mutant's brightness is within 3 std of our benchmark brightness value of original protein, 0 otherwise) and get like 80% prediction accuracy. But still need to check if it produces a sensible decision boundary.

BielStela commented 7 years ago

@tiavlovskiegor24 I realized that, maybe, we can kinda use the feature selection methods in scikit-learn to give scores to the features. The function that do that with regression like data is mutual_info_regression. I've been trying to run it over the complete data set but my 8gb of RAM are not enough :_( and running it on a subset is nonsense since the scores need all the examples to be computed correctly.

What do you think?

P.D: Some documentation: Mutual information wikipedia Mutual Information between Discrete and Continuous Data Sets

tiavlovskiegor24 commented 7 years ago

I would also suggest simplifying our problem from regression to classification. We simply take the brightness of the original protein as a benchmark and give label 1 to the mutants which have brightness no less than 95% (or any other selected percentage) of benchmark brightness, and 0 otherwise.

So in this way we will learn to identify individual mutations which are not expected to decrease the brightness to much from the original. It seems much easier and robust than predicting the actual brightness of individual mutation. And with classification it is much easier to measure performance (accuracy score,f1, recall etc.) and we can extact operating point of the classifier or make it give us its probability bets on each mutation.

With regression there is only least squares measure which can be very sensitive to noise.

What do you think?

BielStela commented 7 years ago

sounds good to me!

katyaputintseva commented 7 years ago

I like the idea. But, unfortunately, 'mutations which are not expected to decrease the brightness too much' wouldn't work. Introducing these mutations simultaneously (which is the aim) will most probably have a cumulative large negative effect that will decrease the brightness substantially :(

tiavlovskiegor24 commented 7 years ago

Well, we can identify beneficial single mutations, pairs or triplets or anything really using the approach above . I just used single mutations as an example because this is the least we have to do with our model.

Essentially my suggestion boils down to predicting whether a supplied mutant (which can have single or multiple mutations) is expected to have brightness above a certain threshold (which we set as we wish and we can also run different thresholds simultaneously). Trying to predict exact brightness instead would be a hell of a mess from what I reckon at the moment.

katyaputintseva commented 7 years ago

@tiavlovskiegor24 Ok, sound good then! I am actually currently running a neural net on aws, while extracting the weights. The configuration which I am using served well before in predicting the folding energy of the protein. Therefore, I believe it must do a good job predicting the weights. But if the approach you suggest gives similar results, then most definitely we should stick to it. I anticipate using deep learning must be an overkill for this task.