kevinblighe / RegParallel

Standard regression functions in R enabled for parallel processing over large data-frames.
37 stars 12 forks source link

Continuing Spot Check to AUC-ROC curves #9

Closed Jatbee32 closed 1 year ago

Jatbee32 commented 1 year ago

Hi Kevin,

I am looking to see if I can move forward with the spot check to get AUC-ROC curves for each gene pulled out of my glm. This might be too broad of a question and maybe more of a question for StackOverflow.

Using your program, get a total of 17 genes that are significant with great OR results (p < 0.0001 and OR > 1). I want to spot-check, which I do per your instructions, but when trying to apply it to obtain an AUC-ROC, I always return with an AUC of 1. It's just not adding up, as I can see the prediction values are not that great.

Mostly I want to know what you consider a best practice to obtain AUC-ROC and if you have any other suggestions for moving forward to obtain predictive values.

Here is my code:

Linear Regression

res2 <- RegParallel(
  data = rlddata,
  formula = '[*] ~ DS',
  FUN = function(formula, data)
  glm(formula = formula,
      data = data,
      method = 'glm.fit'),
  FUNtype = 'glm',
  variables = colnames(rlddata)[1:19336],
  p.adjust = "none")
##############################
#RegParallel
##############################

System is:
-- Darwin
Blocksize:
-- 500
Cores / Threads:
-- 3
Terms included in model:
-- DS
First 5 formulae:
-- A0A061ACU2 ~ DS
-- A0A067XMP1 ~ DS
-- A0A075F932 ~ DS
-- A0A078ISJ6 ~ DS
-- A0A087WPF7 ~ DS
Done!
subset_res2 <- as.data.frame(subset(res2, P<0.0001))
great_ods <- subset(subset_res2, OR > 1)
print(great_ods)
    Variable Term     Beta StandardError        t            P         OR   ORlower
11    B2RY56  DSE 4.127989     0.8817325 4.681679 3.957384e-05  62.052979 11.021244
15    O96006  DSE 1.976061     0.3479210 5.679625 1.863049e-06   7.214270  3.647888
22    P37889  DSE 3.934964     0.8961878 4.390781 9.497001e-05  51.160327  8.832769
24    P43351  DSE 3.057712     0.6361132 4.806867 2.706861e-05  21.278812  6.116272
25    P52551  DSE 1.625281     0.3557391 4.568746 5.566261e-05   5.079848  2.529560
36    Q08DZ2  DSE 2.241615     0.5043837 4.444265 8.092043e-05   9.408512  3.500973
56    Q5SVS4  DSE 2.739885     0.6099191 4.492210 7.007760e-05  15.485198  4.685465
59    Q5ZMD4  DSE 3.453173     0.7649351 4.514334 6.556956e-05  31.600495  7.056347
61    Q68EW0  DSE 3.599518     0.7975346 4.513307 6.577242e-05  36.580614  7.662814
69    Q6PGZ0  DSE 2.550278     0.5765542 4.423310 8.616299e-05  12.810667  4.138166
79    Q8CGF5  DSE 3.696832     0.8422798 4.389078 9.545471e-05  40.319364  7.736843
86    Q8VC88  DSE 4.665648     0.9314622 5.008951 1.461851e-05 106.234434 17.116039
91    Q944S2  DSE 4.501260     0.9755055 4.614285 4.851774e-05  90.130628 13.320503
94    Q96IV0  DSE 4.209848     0.9567025 4.400373 9.228488e-05  67.346289 10.326830
99    Q9D110  DSE 2.813713     0.6371566 4.416046 8.805714e-05  16.671706  4.782240
101   Q9D305  DSE 4.508783     0.9055879 4.978847 1.602693e-05  90.811276 15.392243
110   Q9Y3B9  DSE 3.904715     0.8797946 4.438212 8.240166e-05  49.635924  8.849396
      ORupper
11  349.37729
15   14.26735
22  296.32600
24   74.03003
25   10.20132
36   25.28443
56   51.17771
59  141.51675
61  174.62793
69   39.65844
79  210.11816
86  659.36721
91  609.85159
94  439.19796
99   58.12041
101 535.76906
110 278.40600

Spot testing

m <- glm(B2RY56 ~ DS, data=rlddata_training, family=gaussian)
summary(m)$coefficients
            Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 0.231903  0.4990214 0.4647156 6.458611e-01
DSE         5.280930  0.8957708 5.8954038 2.788298e-06
exp(cbind("Odds ratio" = coef(m), confint.default(m, level = 0.90)))
           Odds ratio        5 %       95 %
(Intercept)   1.260997  0.5549295   2.865435
DSE         196.552674 45.0385436 857.775374
res2[which(res2$Variable == 'B2RY56'),]
  Variable Term     Beta StandardError        t            P       OR  ORlower  ORupper
1:   B2RY56  DSE 4.127989     0.8817325 4.681679 3.957384e-05 62.05298 11.02124 349.3773

AUC ROC from here

test_prob = predict(m, newdata = rlddata_testing, type = "response")
test_roc = roc(rlddata_testing$DS ~ test_prob, plot = TRUE, print.auc = TRUE)
as.numeric(test_roc$auc)
> [1] 1

Rplot

Jatbee32 commented 1 year ago

I figured out a loop and I used a different graphing option to figure it out.