sgibb / topdownr

R-package for the analysis of Thermo Orbitrap Fusion Top-Down Proteomic Data.
https://sgibb.github.io/topdownr/
GNU General Public License v3.0
1 stars 0 forks source link

Random Forest including new parameter chargeType #31

Closed malenasj closed 7 years ago

malenasj commented 7 years ago

Hi all,

According to Pavel the selMZ parameter was unfair since the value was based on a combination of charge and the protein sequence. To accommodate for this issue Pavel added a new parameter called 'chargeType'. Based on this new updated dataset I have trained a new random forest regression and the output can be found in the attached figure.

Any comments?

RandomForestRegression_5prot_chargestate_allfeatures.pdf

pavel-shliaha commented 7 years ago

This looks reasonable, but there is still a problem (just thought of it): the critisism for protein size and protein length are the same as for selMZ. Given we only have 5 observations both protein size and protein length are not only reflecting protein size, but also reflect protein sequence (because each size corresponds to a particular sequence). Could you please prepare 5 models separate for each protein and plot them next to each other using only:

1) charge type 2) CID_activation 3) HCD_activation 4) ETD_activation 5) AGCtarget 6) ETDReagent target

malenasj commented 7 years ago

Yes. I did that and will post the plots later today. You dont want the parameter 'charge' to be included?

pavel-shliaha commented 7 years ago

no charge type only, which takes, low, middle and high values

malenasj commented 7 years ago

Here is the plot for the different proteins. Any comments?

RandomForestRegression_5prot_chargestate_ProteinType.pdf

pavel-shliaha commented 7 years ago

Very interesting. it seems that for all proteins but H4 charge matters most. I have to look through the data a bit to confirm this. But otherwise it is consistent.

@malenasj @veitveit can we please implement this functionality and LSDA in R now?

malenasj commented 7 years ago

I think the RandomForest gives a great deal of information and it is easy to implement. I can have a look at the R code.

pavel-shliaha commented 7 years ago

just had a look at the violin plots for charge states. I am noot sure why H4 had such a small effect on charge state compared to other proteins.

chargeType.pdf

@malenasj Could you please rerun the algorithm again?

malenasj commented 7 years ago

I did and I get the same result

RandomForestRegression_5prot_chargestate_ProteinType.pdf

malenasj commented 7 years ago

Is H4 a histone?

pavel-shliaha commented 7 years ago

Yep. But if u look at the violin for plots for charge states it seems that h4 should be affected by charge as much as other proteins

sgibb commented 7 years ago

@malenasj if you going to implement random forests in R you could try the ranger package. It seems to be much faster than the classic randomForest package: https://www.jstatsoft.org/index.php/jss/article/view/v077i01/v77i01.pdf

malenasj commented 7 years ago

Thanks. I will have a look at this package. I dont have much experience in R, but will try to implement it next week.

sgibb commented 7 years ago

@malenasj that's fine. I would appreciate if you could create a pull request as soon as possible. I am very happy to assist and review your code. Please don't hesitate if you need some help with pull request, coding in R or what ever.

malenasj commented 7 years ago

Hi @sgibb. I have now created the R code for the implementation of the random forest. I dont know where in your code to make the pull request?

sgibb commented 7 years ago

@malenasj great to hear! If you like you could fork this repository, add your file into the R directory and create a pull request. In the pull request I can review the code, add comments and so on (and also show you how to integrate into the current state of the topdown packane).

If this sounds too complicated or you won't have time for such thing you could send me the code by e-mail and I will integrate it.

malenasj commented 7 years ago

Thanks. I just send you the R code by mail as I have no experience using Github.

pavel-shliaha commented 7 years ago

@malenasj please send me the code as well.

sgibb commented 7 years ago

@malenasj thank you for the code. I am wondering whether we really want to integrate this in our topdown package. It is a really small code snippet. If we integrate it we add a dependency to the package (ranger) and force (or lead) the user to random forest. But in general each learning algorithm could be applied to these data.

How do you think about this?

pavel-shliaha commented 7 years ago

@sgibb I agree that we might not want to integrate it as part of the package if we do not make any modifications to the function. Perhaps we can just add this code to the vignette to illustrate how analysis works? So user can easily use the package to assemble topDown object and then just take the code from an example analysis in vignetta? The important contribution by @malenasj is that she has demonstrated that the approach works

sgibb commented 7 years ago

Vignette would be a good idea!

pavel-shliaha commented 7 years ago

this analysis is now complete.