paulgeeleher / pRRophetic

R package
21 stars 13 forks source link

numSens and numRes #1

Closed XLuHouston closed 6 years ago

XLuHouston commented 6 years ago

Hi,

How did you determine the number of sensitive and resistant. In your paper of PLOS ONE, you mentioned that a cutpoint would be used to classifiy samples into dichotomous predictions.

I think this dichotomous prediction is the pRRopheticLogisticPredict() function which is based on classifysample(). I saw the R code and I didn't notice a cutpoint finding procedure.

Did I lose something? Hope you could give me some hints.

Thanks a lot, Lucas

paulgeeleher commented 6 years ago

Hi Lucas,

I'm not sure exactly which example specifically you're referring to... However most of the data discussed in the Genome Biology paper and all examples in the PLOS One paper were trained the models using linear ridge regression, which does not require a cut point to be defined.

The logisitic regression procedure was used for the Erlotinib example in the original Genome Biology paper. The reason for this is that for highly targeted agents (like Erlotinib) the drug response data follows a skewed distribution compared some other drugs (e.g. cytotoxic chemo), so a linear model isn't really very suitable for these data. What we suggested at the time was dichotimizing these data and applying logistic regression. The number of samples to call "sensitive" was the approximate number that were responding (i.e. achieving a measurable (not extrapolated) IC50) within the drug screening window that was defined by GDSC (which I think is mentioned in the Genome Biology paper text), because GDSC adjusted these screening windows for each drug based on biological relevance.

Generally speaking, I don't think the logistic model is a great solution to this problem. That said, in the Genome Biology paper, I think it was worth proposing a solution to the problem that arises because of the very different distributions of the different drugs, I'm just not sure that the solution we proposed was optimal, and I would certainly say there is room for improvement. I would also say that its been something I have would like to work on, but so far haven't been able to make the time.

Best wishes,

Paul

On Wed, Jun 27, 2018 at 9:42 AM, XLuHouston notifications@github.com wrote:

Hi,

How did you determine the number of sensitive and resistant. In your paper of PLOS ONE, you mentioned that a cutpoint would be used to classifiy samples into dichotomous predictions.

I think this dichotomous prediction is the pRRopheticLogisticPredict() function which is based on classifysample(). I saw the R code and I didn't notice a cutpoint finding procedure.

Did I lose something? Hope you could give me some hints.

Thanks a lot, Lucas

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/paulgeeleher/pRRophetic/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AGj054JkIVds_FfePA9BkMaVbf1kBnB_ks5uA5nRgaJpZM4U52PW .

-- Paul Geeleher KCBD, Room 3220 Department of Medicine The University of Chicago 900 E. 57th St., Chicago, IL 60637 USA

XLuHouston commented 6 years ago

Thanks for that.

What I am referring is your paper published in PLOS ONE 2014 titled "pRRophetic: An R Package for Prediction of Clinical Chemotherapeutic Response from Tumor Gene Expression Levels".

In the Results and Discussion session, under Predicting clinical chemotherapeutic response, you mentioned "Clinically, it is common practice to report dichotomous predictions, for example classifying patients as either ‘‘sensitive’’ or ‘‘resistant’’ to a drug. Hence, we have included functions that estimate a cutpoint using the mean IC50 value in the training data, thus segmenting patients into two groups based on their predicted drug sensitivity". Thus I am really confused about this cutpoint because I haven't see the procedure in your R code.

Btw, you just said GDSC determined the cutpoint-like value based on biological relevance, is that "max concentration" which makes point into red (Res) and green (Sen) groups?

Best, Lucas

paulgeeleher commented 6 years ago

Hi Lucas,

Oh, okay I think I know what you're referring to now (you can basically disregard my first answer). The bold text is referring to the "getPPV()" function. Essentially, one of the peer reviewers asked us to calculate PPV and NPV values; in order to do this we needed to convert the continuous values output by the linear ridge regression to to dichotomous values. We did this by separating the clinical predictions at the mean value in the training set, as this was the simplest approach we could think of (implemented in getPPV(), there's a variable in there called "cutpoint"). Generally though, I don't really think there's much reason to dichotimize these values in this way (as I said it was included by request of a reviewer). I've also just realized that this getPPV() function isn't included in this repository (I guess because it was added during the review). However, you can find it with the updated pRRophetic2 code: https://github.com/paulgeeleher/pRRophetic2/blob/master/pRRophetic/R/predict_from_cgp.R

Sorry for the confusion!

Also, yes, "max concentration" defines their upper bound for screening, which in GDSC varies drug by drug (note in CCLE this value was fixed at 8uM).

XLuHouston commented 6 years ago

Ok, I got it.

Another tiny question :)

There is removal batch effect in your code, but it seems no normalization procedure was specified. Thus, should the training expression data be normalized first? like log transform for z-scored?

By the way, I use TCGA count data or FPKM. Because in the homogenizeData(), you choose combat or z-score (standardized the mean and variance of each gene), but only one would be used, so how about the input data, should I log transformed the normalized count or FPKM first, and then perform combat in your package?

paulgeeleher commented 6 years ago

Hi Lucas,

The GDSC cell line gene expression microarray data built into pRRophetic were RMA normalized. I believe one of the scripts included with the 2014 Genome Biology paper contains the code to do this, so there's no need to normalize the training data if you're using the GDSC data built into pRRophetic. If you are applying models to TCGA, we used log((TPM*1000000)+1) values (TPM values obtained from GDC), because these values were reasonably normally distributed. But there are many different possibilities for normalization and I think the correct choice will depend on your situation / objective.

Best wishes,

Paul.