haowulab / DSS

14 stars 13 forks source link

Question about paired experimental design with correction plus testing interaction effect? #41

Open KristenKrolick opened 1 year ago

KristenKrolick commented 1 year ago

@haowulab

Hello, I just performed a paired analysis. For my next analysis, I will need the methylation values per individual sample for each of my significant CpG sites in the paired analysis. Since I performed the DMLfit to correct for differences in cell types, I do not want to take the original methylation values (from my bismark cov files) for performing my next analysis. Is there a way to get the DSS corrected methylation values per individual sample back?

Thank you so much for your help and tool! Kristen

KristenKrolick commented 1 year ago

@haowulab Nevermind, I just read your paper and FAQ on the DSS manual page about how relative methylation values should not be extracted here.

Instead, Do you know of a way I can use the results from my paired analysis to perform a second statistical test? Basically, I am trying to find a work around here. I found those CpG sites that were significantly different between before and after surgery (after correcting for cell type fractions), and now I want to take those sites and see if they are associated with phenotypes of surgical complications...

Code: Treatment = factor(rep(c("Preop","Postop"),each = 51)) pair = factor(rep(1:51,2)) design = data.frame(Treatment, pair, covar)

DMLfit = DMLfit.multiFactor(chr2_bismarkBSseq, design, formula = ~Treatment + pair + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + CD8_TOTAL + GRANULOCYTES_TOTAL) dmlTest = DMLtest.multiFactor(DMLfit, term="Treatment")

haowulab commented 1 year ago

Sorry for my tardy reply. You can certainly analyze paired data using DSS multiple factor function. But you need to be careful to setup the design. Please read this for RNA-seq analysis using DESeq2: https://support.bioconductor.org/p/84241/#84256. Basically, the effect you try to test should be an interaction term.

KristenKrolick commented 1 year ago

@haowulab Thank you so much for responding! I am reading the example you sent, and the experimental design is a very similar example to what I am doing, thank you. However, I am getting a little confused between the DESeq2 example you sent plus and the DSS guide for setting up paired design (see my question below):

Q1) Could you please help with how I setup (the paired design plus test for an interaction of chronic pain response) exactly using DSS multiple factor function for bsseq data? I have n = 51 preoperation and postoperation patients. Need to correct for different cell type fractions before surgery and after surgery. For those CpG sites differently methylated after surgery, need to test their association with chronic pain (so as you said setup as interaction effect somehow). I will call chronic pain the "response". Where would I put the response variable in my equation below? The DESeq2 example was showing their patient.id already setup as a sort of paired design and then the response variable being in the equation like so: ~Patient.ID + treatment * (status + Response), w/ status being what they wanted to correct for.

So far, mine looks like this:

DMLfit = DMLfit.multiFactor(BSobj, design, formula = ~ Treatment + pair + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + CD8_TOTAL + GRANULOCYTES_TOTAL)

I had set it up using:

covar <- read.csv(#path to .csv containing my cell type fractions to correct for next to filename of each patient...

Treatment = factor(rep(c("Preop","Postop"),each = 51))
pair = factor(rep(1:51,2))
design = data.frame(Treatment, pair, covar)

And where my design dataframe ended up looking like this: design.csv

I know I need to add in a "response" column to my dataframe in which 1= pain and 0 = no pain; then somewhere in my equation add in the response. Also not sure but DESeq2 example there were talking about it being important that their paired design was set up 1,1,2,2,3,3,4,4,5,5,6,6 instead of 1,2,3,4,5,6,1,2,3,4,5,6 for some reason? Should I change mine?

Sincerely grateful for your help and time to allow us to perform this analysis, Kristen

haowulab commented 1 year ago

The paired design described in the DSS manual is a very simple one, where only treatment (surgery in your design) is tested. In your study, I believe you want to test the association between the change of methylation before/after surgery and chronic pain. So you want to study the change (with chronic pain) of change (with surgery), which is an interaction.

Do you have cell type proportions before and after surgery? If so, your formula should be

~ Treatment + pair + Response*( Treatment + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + CD8_TOTAL + GRANULOCYTES_TOTAL)

Then you want to test the Response*Treatment term. For that you need to figure out where it is in the design matrix or the name for the term.

I wrote a simple simulation for you to play with. See below. You can study this to understand the concept.

### do a paried test simulation                                                                     
x.bef = rnorm(10)
x.aft = x.bef + 1 + rnorm(10,sd=0.5)
y.bef = rnorm(10, mean=1)
y.aft = y.bef - 1 + rnorm(10,sd=0.5)

boxplot(x.bef,x.aft, y.bef, y.aft)
boxplot(x.aft-x.bef, y.aft-y.bef)

## to test without considering pairing - not significant                                            
t.test(c(x.bef, y.bef), c(x.aft,y.aft))

## to test with considering pairing - significant                                                   
t.test(x.aft-x.bef, y.aft-y.bef)

## do a regression                                                                                  
dat = c(x.bef, y.bef, x.aft, y.aft)

condition = factor(rep(c("Preop","Postop"),each = 20))
pair = factor(rep(1:20,2))
Trt = factor( rep(rep(c("trt1","trt2"), each = 10), 2))
design = data.frame(condition, pair, Trt)

dd = data.frame(dat, design)
fit = lm(dat~pair + condition + Trt + condition:Trt)
summary(fit)
## note, the coefficient for the interaction term is the same as the t-test statistics              
KristenKrolick commented 1 year ago

@haowulab

Thank you Hao! Yes, I have cell type proportions both before and after surgery (please see design.csv). Went through the toy example you sent to try to understand it a little better, but mainly had to ask our collaborating Statistician for help.

The ~ Treatment + pair + Response*( Treatment + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + CD8_TOTAL + GRANULOCYTES_TOTAL) you sent returned the error: Error in solve.default(XTVinv %*% X) : system is computationally singular.... Our collaborating Statistician said that would test pain specific cell type effect (not our question).

She told me to use the following formula instead : R DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + response + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + CD8_TOTAL + GRANULOCYTES_TOTAL + (Treatment*response)). I just wanted to confirm with you that this formula will give us the pain-specific pre-post difference correcting for cell types? I am trying to understand which places of before and after tilde correspond to the x and y in "y=Bo + B1x"? Are we sure formula should not be ~pair instead of ~treatment? If I run this formula, and then run dmlTest = DMLtest.multiFactor(DMLfit, term="Treatment*response"), the following error is returned: Error in makeContrast(DMLfit, term) : Some term(s) to be tested can't be found in the formula.

Thank you for being so helpful to allow us to perform this analysis! Kristen

haowulab commented 1 year ago

The singularity problem is because the design matrix is not of full rank. Perhaps you have to remove one of the cell type proportions in the model, since they sum up to 1 they are not independent.

For the formula suggested by your statistician: you can use that formula too. But if the cell type proportions are independent with Response, it won't make much difference. However, if the proportions are associated with Response, the interpretation will be different. Even with the interaction, if you test Response:Treatment, it's not testing the "pain specific cell type effect". Such effect can be tested by testing things like Response:Mono_3, etc. You can play with the following codes:

x.bef = rnorm(10)
x.aft = x.bef + 1 + rnorm(10,sd=0.5)
y.bef = rnorm(10, mean=1)
y.aft = y.bef - 1 + rnorm(10,sd=0.5)

boxplot(x.bef,x.aft, y.bef, y.aft)
boxplot(x.aft-x.bef, y.aft-y.bef)

## to test without considering pairing - not significant
t.test(c(x.bef, y.bef), c(x.aft,y.aft))

## to test with considering pairing - significant
t.test(x.aft-x.bef, y.aft-y.bef)

## do a regression
x = c(x.bef, x.aft)
y = c(y.bef, y.aft)
dat = c(x.bef, y.bef, x.aft, y.aft)

pair = factor(rep(1:20,2))
Treatment = factor(rep(c("Preop","Postop"),each = 20))
Response = factor( rep(rep(c("Good","Bad"), each = 10), 2))
## randomly generate proportions, assuming two cell types
ct1 = runif(length(condition))
ct2 = 1-ct1

## now do regression

## without proportion
fit = lm(dat ~ pair + Treatment + Response + Response:Treatment)
summary(fit)
## same as t-test

## with proportion, but without proportion by Response interaction 
fit = lm(dat ~ pair + Treatment + Response + ct1 + ct2 + Response:Treatment)
summary(fit)
## a little different

## with proportion by Response interaction 
dd = data.frame(dat, design)
fit = lm(dat ~ pair + Treatment + Response*(Treatment+ct1+ct2))
summary(fit)
## slightly different 
haowulab commented 1 year ago

By the way, I'm curious how you get the cell type proportions. Did you estimate them from the data?

KristenKrolick commented 1 year ago

@haowulab

Hmm I am in the middle of studying your examples (and multiple regression in general), but I do not understand: Q1) The 'fit = lm()' function in R seems to be a little different from your 'DMLfit = DMLfit.multifactor function'-- as in the dependent variable (aka the y in y = B0 + B1x) usually goes to the left of the tilde. In your function are we assuming that to the left of the tilde would be the Bsobj, the methylation values? Q2) I am not understanding how we are testing the difference in methylation between preop and postop (after correcting for blood cell type fractions) from this formula? -- I suppose it's okay if I don't quite understand this one, as long as you and our Statistician confirm this for me. Q3) Okay so if I use the DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + response + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + CD8_TOTAL + GRANULOCYTES_TOTAL + (Treatment*response)). I get the error, Error in makeContrast(DMLfit, term) : Some term(s) to be tested can't be found in the formula . How would I fix this?

For my blood fraction in silico estimation:

  1. I used the Reference provided by Salas et al. 2022 (https://www.nature.com/articles/s41467-021-27864-7), actual Ref available:(https://github.com/immunomethylomics/FlowSorted.BloodExtended.EPIC/tree/main).
  2. I converted this reference from the Illumina Array sites into UCSC chromosomal coordinates using the manifest provided by Illumina.
  3. I used Epidish to perform the deconvolution. All of this is outlined and the code I used here: https://github.com/KristenKrolick/Chidambaran_Lab_Manuscript/blob/main/n)%20Epidish p.s. We had samples that we had flowcyto data for as well as whole-genome meth seq, so I was able to perform a Pearson correlation, and (at least for those cells whose same cell markers were used in our flow data vs. salas ref cell markers he used to make his Ref), they correlated! So, I really liked his Reference.
KristenKrolick commented 1 year ago

Oh! And Q4) Your DSS manual outlines reasoning for why one would not need to subset their whole-genome bisulfite seq data by minimum coverage threshold: does the same apply for the general hypothesis testing design that I am using??

...I would think I would want to subset my data by at least mincov 4 or so, just to qualify that I am getting an accurate methylation call since my seq. data is from whole blood (mixed cells)...

Thank you!

haowulab commented 1 year ago

See my comments below.

On Nov 12, 2023, at 12:24 AM, KristenKrolick @.***> wrote:

@haowulab https://github.com/haowulab Hmm I am in the middle of studying your examples (and multiple regression in general), but I do not understand: Q1) The 'fit = lm()' function in R seems to be a little different from your 'DMLfit = DMLfit.multifactor function'-- as in the dependent variable (aka the y in y = B0 + B1x) usually goes to the left of the tilde. In your function are we assuming that to the left of the tilde would be the Bsobj, the methylation values?

Yes this is correct.

Q2) I am not understanding how we are testing the difference in methylation between preop and postop (after correcting for blood cell type fractions) from this formula? -- I suppose it's okay if I don't quite understand this one, as long as you and our Statistician confirm this for me.

Yes. consult your statistician.

Q3) Okay so if I use the DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + response + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + CD8_TOTAL + GRANULOCYTES_TOTAL + (Treatment*response)). I get the error, Error in makeContrast(DMLfit, term) : Some term(s) to be tested can't be found in the formula . How would I fix this?

hmmm, this error should not appear in DMLfit.multiFactor function. It should be in DMLtest.multiFactor function, because you are trying to test something. What did you use for that? To choose the proper term to test is a bit complicated, especially in your design. Please read the manual, and consult your statistician.

For my blood fraction in silico estimation:

I used the Reference provided by Salas et al. 2022 (https://www.nature.com/articles/s41467-021-27864-7), actual Ref available:(https://github.com/immunomethylomics/FlowSorted.BloodExtended.EPIC/tree/main). I converted this reference from the Illumina Array sites into UCSC chromosomal coordinates using the manifest provided by Illumina. I used Epidish to perform the deconvolution. All of this is outlined and the code I used here: https://github.com/KristenKrolick/Chidambaran_Lab_Manuscript/blob/main/n)%20Epidish p.s. We had samples that we had flowcyto data for as well as whole-genome meth seq, so I was able to perform a Pearson correlation, and (at least for those cells whose same cell markers were used in our flow data vs. salas ref cell markers he used to make his Ref), they correlated! So, I really liked his Reference.

Okay thanks. I’m not sure if using epidish will be good for BS-seq. What’s the correlation?

— Reply to this email directly, view it on GitHub https://github.com/haowulab/DSS/issues/41#issuecomment-1806857609, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7H4N7OSVYKSJ2N5H3XTZLYD6RDJAVCNFSM6AAAAAA6472JXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBWHA2TONRQHE. You are receiving this because you were mentioned.

haowulab commented 1 year ago

You can certainly filter the sites by coverage, but it’s not necessary. DSS will consider the coverage in the model. If the coverage is very shallow, it’s very unlikely that the site will be called as DML. This said, it doesn’t hurt to filter.

On Nov 12, 2023, at 12:36 AM, KristenKrolick @.***> wrote:

Oh! And Q4) Your DSS manual outlines reasoning for why one would not need to subset their whole-genome bisulfite seq data by minimum coverage threshold: does the same apply for the general hypothesis testing design that I am using??

...I would think I would want to subset my data by at least mincov 4 or so, just to qualify that I am getting an accurate methylation call since my seq. data is from whole blood (mixed cells)...

Thank you!

— Reply to this email directly, view it on GitHub https://github.com/haowulab/DSS/issues/41#issuecomment-1806860364, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7H4NYJPWGOJBPVKEJFIL3YD6SPRAVCNFSM6AAAAAA6472JXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBWHA3DAMZWGQ. You are receiving this because you were mentioned.

KristenKrolick commented 1 year ago

@haowulab In response to Q3-- sorry it was the same question from my comment 4 days ago, and I did not write it out fully again:

So the design matrix is the same as above (4 days ago comment).

Here was my code:

DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + response + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + CD8_TOTAL + GRANULOCYTES_TOTAL + (Treatment*response))
dmlTest = DMLtest.multiFactor(DMLfit, term="(Treatment*response)")
#Error in makeContrast(DMLfit, term) : Some term(s) to be tested can't be found in the formula.

treatment*response was clearly in the formula? Should I send all the beginning parts of my code again in me making the design dataframe?

KristenKrolick commented 1 year ago

...and If I look at the doublearry in DMLfit called "X". It is 102 rows by 60 columns, and looks like this: X.csv

retried the dmlTest like so: dmlTest = DMLtest.multiFactor(DMLfit, term="TreatmentPreop:response1") and same error was returned.

Oh I have to name it as coef if I use it that way, using: dmlTest = DMLtest.multiFactor(DMLfit, coef="TreatmentPreop:response1") and it is running now!

KristenKrolick commented 12 months ago

@haowulab I would like to point out that I had my design data.frame wrong because of the way I had input my pain values (they did not correspond to the same order as my file names & were hence random data). Here is me fixing this and now including the full code:

Basically, now when my design frame,design.csv, is fixed so that my "response" aka pain "yes" or "no" variables are in the correct location, DMLfit no longer works, returning collinearity error:


require(bsseq)
library(dmrseq)
files <-list.files("/data/path/bismarkcovfiles")
files
setwd("/data/path/bismarkcovfiles")
bismarkBSseq <- read.bismark(files,
                             loci = NULL,
                             colData = NULL,
                             rmZeroCov = FALSE,
                             strandCollapse = TRUE,
                             verbose = TRUE)
#covar dataframe consisting of columns of cell type estimates and rows consisting of each sample:
## (preop samples 1 through 51 as rows 1 through 51 and then postop samples as rows 52 - 102)
covar <- read.csv("/data/path/CellTypeCovar_same51Visit1_2.csv")

Treatment = factor(rep(c("Preop","Postop"),each = 51))
pair = factor(rep(1:51,2))
response <- factor(c("Yes", "No", "Yes", "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "No", "Yes", "No", "No", "No", "Yes", "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "No", "Yes", "No", "No", "No", "Yes", "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes"))
#also tried this guy immediately below just in case, but had same error type of collinearity after attempting DMLfit function:
#response = factor(c("1", "0", "1", "1", "1", "0", "0", "1", "0", "0", "0", "1", "1", "1", "0", "0", "0", "1", "0", "0", "1", "0", "1", "1", "0", "0", "1", "1", "0", "0", "0", "1", "0", "0", "0", "1", "1", "0", "1", "0", "1", "1", "1", "1", "1", "0", "1", "1", "1", "1", "1", "1", "0", "1", "1", "1", "0", "0", "1", "0", "0", "0", "1", "1", "1", "0", "0", "0", "1", "0", "0", "1", "0", "1", "1", "0", "0", "1", "1", "0", "0", "0", "1", "0", "0", "0", "1", "1", "0", "1", "0", "1", "1", "1", "1", "1", "0", "1", "1", "1", "1", "1"))
##and also tried this guy immediately below just in case, but had same error type of collinearity after attempting DMLfit function:
#response = c("1", "0", "1", "1", "1", "0", "0", "1", "0", "0", "0", "1", "1", "1", "0", "0", "0", "1", "0", "0", "1", "0", "1", "1", "0", "0", "1", "1", "0", "0", "0", "1", "0", "0", "0", "1", "1", "0", "1", "0", "1", "1", "1", "1", "1", "0", "1", "1", "1", "1", "1", "1", "0", "1", "1", "1", "0", "0", "1", "0", "0", "0", "1", "1", "1", "0", "0", "0", "1", "0", "0", "1", "0", "1", "1", "0", "0", "1", "1", "0", "0", "0", "1", "0", "0", "0", "1", "1", "0", "1", "0", "1", "1", "1", "1", "1", "0", "1", "1", "1", "1", "1")
design = data.frame(Treatment, pair, response, covar)

#need to filter bismarkBSseq by chromosome because my analysis is too large to run 
chr1_bismarkBSseq<- chrSelectBSseq(bismarkBSseq, seqnames = "chr1", order = TRUE)

DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + response + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + CD8_TOTAL + GRANULOCYTES_TOTAL + (Treatment*response))
#Error returned: "Fitting DML model for CpG site: Error in solve.default(XTVinv %*% X) : system is computationally singular: reciprocal condition number = 1.59732e-18"
##Hmmm above says it means I have a problem of collinearity, so I ran a Pearson Correlation and turns out CD8_total was related to pain, so getting rid of CD8:
DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + response + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + GRANULOCYTES_TOTAL + (Treatment*response))
#Error returned: "Fitting DML model for CpG site: Error in solve.default(XTVinv %*% X) : system is computationally singular: reciprocal condition number = 1.61807e-18"

###trying it without NK and CD8 which were correlated > .1
DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + response + Mono_3 + B_TOTAL + CD4_TOTAL + GRANULOCYTES_TOTAL + (Treatment*response))
#Error returned: "Fitting DML model for CpG site: Error in solve.default(XTVinv %*% X) : system is computationally singular: reciprocal condition number = 1.62955e-18"

##trying it without NK and CD8, B, and Mono which were almost closer to correlated?
DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + response + CD4_TOTAL + GRANULOCYTES_TOTAL + (Treatment*response))
#Error returned: "Fitting DML model for CpG site: Error in solve.default(XTVinv %*% X) : system is computationally singular: reciprocal condition number = 1.67338e-18"

#Also tried using Hao's formula: 
DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + response*(Treatment + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + CD8_TOTAL + GRANULOCYTES_TOTAL))
#Error returned: "Fitting DML model for CpG site: Error in solve.default(XTVinv %*% X) : system is computationally singular: reciprocal condition number = 1.42241e-18"

#Trying Hao's formula without CD8:
DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + response*(Treatment + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + GRANULOCYTES_TOTAL))
#Error returned: "Fitting DML model for CpG site: Error in solve.default(XTVinv %*% X) : system is computationally singular: reciprocal condition number = 1.45254e-18"

sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /usr/local/R/4.2.1/lib64/R/lib/libRblas.so
LAPACK: /usr/local/R/4.2.1/lib64/R/lib/libRlapack.so

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
[8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
  [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] dmrseq_1.18.1               DSS_2.46.0                  BiocParallel_1.32.4         bsseq_1.34.0                SummarizedExperiment_1.28.0 Biobase_2.58.0              MatrixGenerics_1.10.0      
[8] matrixStats_0.63.0          GenomicRanges_1.50.1        GenomeInfoDb_1.34.9         IRanges_2.32.0              S4Vectors_0.36.1            BiocGenerics_0.44.0        

loaded via a namespace (and not attached):
  [1] colorspace_2.1-0              rjson_0.2.21                  ellipsis_0.3.2                XVector_0.38.0                rstudioapi_0.14               bit64_4.0.5                  
[7] interactiveDisplayBase_1.36.0 AnnotationDbi_1.60.0          fansi_1.0.4                   xml2_1.3.3                    codetools_0.2-18              splines_4.2.1                
[13] R.methodsS3_1.8.2             sparseMatrixStats_1.10.0      cachem_1.0.6                  Rsamtools_2.14.0              dbplyr_2.2.1                  png_0.1-8                    
[19] R.oo_1.25.0                   shiny_1.7.3                   HDF5Array_1.26.0              BiocManager_1.30.19           readr_2.1.3                   compiler_4.2.1               
[25] httr_1.4.4                    assertthat_0.2.1              Matrix_1.5-3                  fastmap_1.1.0                 limma_3.54.0                  cli_3.6.1                    
[31] later_1.3.0                   htmltools_0.5.4               prettyunits_1.1.1             tools_4.2.1                   gtable_0.3.3                  glue_1.6.2                   
[37] GenomeInfoDbData_1.2.9        annotatr_1.24.0               reshape2_1.4.4                dplyr_1.1.1                   doRNG_1.8.2                   rappdirs_0.3.3               
[43] Rcpp_1.0.10                   bumphunter_1.40.0             vctrs_0.6.1                   Biostrings_2.66.0             rhdf5filters_1.10.0           nlme_3.1-160                 
[49] rtracklayer_1.58.0            iterators_1.0.14              DelayedMatrixStats_1.20.0     stringr_1.5.0                 mime_0.12                     lifecycle_1.0.3              
[55] restfulr_0.0.15               rngtools_1.5.2                gtools_3.9.4                  XML_3.99-0.13                 AnnotationHub_3.6.0           zlibbioc_1.44.0              
[61] scales_1.2.1                  BSgenome_1.66.2               hms_1.1.2                     promises_1.2.0.1              rhdf5_2.42.0                  RColorBrewer_1.1-3           
[67] yaml_2.3.6                    curl_4.3.3                    memoise_2.0.1                 ggplot2_3.4.2                 biomaRt_2.54.0                stringi_1.7.12               
[73] RSQLite_2.2.19                BiocVersion_3.16.0            BiocIO_1.8.0                  foreach_1.5.2                 permute_0.9-7                 GenomicFeatures_1.50.2       
[79] filelock_1.0.2                rlang_1.1.0                   pkgconfig_2.0.3               bitops_1.0-7                  lattice_0.20-45               Rhdf5lib_1.20.0              
[85] GenomicAlignments_1.34.0      bit_4.0.5                     tidyselect_1.2.0              plyr_1.8.8                    magrittr_2.0.3                R6_2.5.1                     
[91] generics_0.1.3                DelayedArray_0.24.0           DBI_1.1.3                     pillar_1.9.0                  KEGGREST_1.38.0               RCurl_1.98-1.9               
[97] tibble_3.2.1                  crayon_1.5.2                  utf8_1.2.3                    BiocFileCache_2.6.0           tzdb_0.3.0                    progress_1.2.2               
[103] locfit_1.5-9.6                grid_4.2.1                    data.table_1.14.8             blob_1.2.3                    digest_0.6.31                 xtable_1.8-4                 
[109] httpuv_1.6.6                  regioneR_1.30.0               outliers_0.15                 R.utils_2.12.2                munsell_0.5.0                
> ```
haowulab commented 12 months ago

There might be confounding among your variables. You can look at the design matrix to see what’s going on. Please consult your statistician to see what you can do.

发件人: @. @. 代表 KristenKrolick 发送时间: 2023年11月17日 3:49 收件人: haowulab/DSS 抄送: Hao Wu; Mention 主题: Re: [haowulab/DSS] Question about paired experimental design with correction plus testing interaction effect? (Issue #41)

@haowulab https://github.com/haowulab Sorry about this, I would just like to point out that I had my design data.frame wrong because of the way I had input my pain values (they did not correspond to the same order as my file names & were hence random data). Here is me fixing this and now including the full code:

Basically, now when my design frame,design.csv https://github.com/haowulab/DSS/files/13382740/design.csv , is fixed so that my "response" aka pain "yes" or "no" variables are in the correct location, DMLfit no longer works, returning collinearity error:

require(bsseq) library(dmrseq) files <-list.files("/data/path/bismarkcovfiles") files setwd("/data/path/bismarkcovfiles") bismarkBSseq <- read.bismark(files, loci = NULL, colData = NULL, rmZeroCov = FALSE, strandCollapse = TRUE, verbose = TRUE)

covar dataframe consisting of columns of cell type estimates and rows consisting of each sample:

(preop samples 1 through 51 as rows 1 through 51 and then postop samples as rows 52 - 102)

covar <- read.csv("/data/path/CellTypeCovar_same51Visit1_2.csv")

Treatment = factor(rep(c("Preop","Postop"),each = 51)) pair = factor(rep(1:51,2)) response <- factor(c("Yes", "No", "Yes", "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "No", "Yes", "No", "No", "No", "Yes", "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "No", "Yes", "No", "No", "No", "Yes", "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes"))

also tried this guy immediately below just in case, but had same error type of collinearity after attempting DMLfit function:

response = factor(c("1", "0", "1", "1", "1", "0", "0", "1", "0", "0", "0", "1", "1", "1", "0", "0", "0", "1", "0", "0", "1", "0", "1", "1", "0", "0", "1", "1", "0", "0", "0", "1", "0", "0", "0", "1", "1", "0", "1", "0", "1", "1", "1", "1", "1", "0", "1", "1", "1", "1", "1", "1", "0", "1", "1", "1", "0", "0", "1", "0", "0", "0", "1", "1", "1", "0", "0", "0", "1", "0", "0", "1", "0", "1", "1", "0", "0", "1", "1", "0", "0", "0", "1", "0", "0", "0", "1", "1", "0", "1", "0", "1", "1", "1", "1", "1", "0", "1", "1", "1", "1", "1"))

and also tried this guy immediately below just in case, but had same error type of collinearity after attempting DMLfit function:

response = c("1", "0", "1", "1", "1", "0", "0", "1", "0", "0", "0", "1", "1", "1", "0", "0", "0", "1", "0", "0", "1", "0", "1", "1", "0", "0", "1", "1", "0", "0", "0", "1", "0", "0", "0", "1", "1", "0", "1", "0", "1", "1", "1", "1", "1", "0", "1", "1", "1", "1", "1", "1", "0", "1", "1", "1", "0", "0", "1", "0", "0", "0", "1", "1", "1", "0", "0", "0", "1", "0", "0", "1", "0", "1", "1", "0", "0", "1", "1", "0", "0", "0", "1", "0", "0", "0", "1", "1", "0", "1", "0", "1", "1", "1", "1", "1", "0", "1", "1", "1", "1", "1")

design = data.frame(Treatment, pair, response, covar)

need to filter bismarkBSseq by chromosome because my analysis is too large to run

chr1_bismarkBSseq<- chrSelectBSseq(bismarkBSseq, seqnames = "chr1", order = TRUE)

DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + response + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + CD8_TOTAL + GRANULOCYTES_TOTAL + (Treatment*response))

Error returned: "Fitting DML model for CpG site: Error in solve.default(XTVinv %*% X) : system is computationally singular: reciprocal condition number = 1.59732e-18"

Hmmm above says it means I have a problem of collinearity, so I ran a Pearson Correlation and turns out CD8_total was related to pain, so getting rid of CD8:

DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + response + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + GRANULOCYTES_TOTAL + (Treatment*response))

Error returned: "Fitting DML model for CpG site: Error in solve.default(XTVinv %*% X) : system is computationally singular: reciprocal condition number = 1.61807e-18"

trying it without NK and CD8 which were correlated > .1

DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + response + Mono_3 + B_TOTAL + CD4_TOTAL + GRANULOCYTES_TOTAL + (Treatment*response))

Error returned: "Fitting DML model for CpG site: Error in solve.default(XTVinv %*% X) : system is computationally singular: reciprocal condition number = 1.62955e-18"

trying it without NK and CD8, B, and Mono which were almost closer to correlated?

DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + response + CD4_TOTAL + GRANULOCYTES_TOTAL + (Treatment*response))

Error returned: "Fitting DML model for CpG site: Error in solve.default(XTVinv %*% X) : system is computationally singular: reciprocal condition number = 1.67338e-18"

Also tried using Hao's formula:

DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + response*(Treatment + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + CD8_TOTAL + GRANULOCYTES_TOTAL))

Error returned: "Fitting DML model for CpG site: Error in solve.default(XTVinv %*% X) : system is computationally singular: reciprocal condition number = 1.42241e-18"

Trying Hao's formula without CD8:

DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + response*(Treatment + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + GRANULOCYTES_TOTAL))

Error returned: "Fitting DML model for CpG site: Error in solve.default(XTVinv %*% X) : system is computationally singular: reciprocal condition number = 1.45254e-18"

sessionInfo() R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS: /usr/local/R/4.2.1/lib64/R/lib/libRblas.so LAPACK: /usr/local/R/4.2.1/lib64/R/lib/libRlapack.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8
[8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] dmrseq_1.18.1 DSS_2.46.0 BiocParallel_1.32.4 bsseq_1.34.0 SummarizedExperiment_1.28.0 Biobase_2.58.0 MatrixGenerics_1.10.0
[8] matrixStats_0.63.0 GenomicRanges_1.50.1 GenomeInfoDb_1.34.9 IRanges_2.32.0 S4Vectors_0.36.1 BiocGenerics_0.44.0

loaded via a namespace (and not attached): [1] colorspace_2.1-0 rjson_0.2.21 ellipsis_0.3.2 XVector_0.38.0 rstudioapi_0.14 bit64_4.0.5
[7] interactiveDisplayBase_1.36.0 AnnotationDbi_1.60.0 fansi_1.0.4 xml2_1.3.3 codetools_0.2-18 splines_4.2.1
[13] R.methodsS3_1.8.2 sparseMatrixStats_1.10.0 cachem_1.0.6 Rsamtools_2.14.0 dbplyr_2.2.1 png_0.1-8
[19] R.oo_1.25.0 shiny_1.7.3 HDF5Array_1.26.0 BiocManager_1.30.19 readr_2.1.3 compiler_4.2.1
[25] httr_1.4.4 assertthat_0.2.1 Matrix_1.5-3 fastmap_1.1.0 limma_3.54.0 cli_3.6.1
[31] later_1.3.0 htmltools_0.5.4 prettyunits_1.1.1 tools_4.2.1 gtable_0.3.3 glue_1.6.2
[37] GenomeInfoDbData_1.2.9 annotatr_1.24.0 reshape2_1.4.4 dplyr_1.1.1 doRNG_1.8.2 rappdirs_0.3.3
[43] Rcpp_1.0.10 bumphunter_1.40.0 vctrs_0.6.1 Biostrings_2.66.0 rhdf5filters_1.10.0 nlme_3.1-160
[49] rtracklayer_1.58.0 iterators_1.0.14 DelayedMatrixStats_1.20.0 stringr_1.5.0 mime_0.12 lifecycle_1.0.3
[55] restfulr_0.0.15 rngtools_1.5.2 gtools_3.9.4 XML_3.99-0.13 AnnotationHub_3.6.0 zlibbioc_1.44.0
[61] scales_1.2.1 BSgenome_1.66.2 hms_1.1.2 promises_1.2.0.1 rhdf5_2.42.0 RColorBrewer_1.1-3
[67] yaml_2.3.6 curl_4.3.3 memoise_2.0.1 ggplot2_3.4.2 biomaRt_2.54.0 stringi_1.7.12
[73] RSQLite_2.2.19 BiocVersion_3.16.0 BiocIO_1.8.0 foreach_1.5.2 permute_0.9-7 GenomicFeatures_1.50.2
[79] filelock_1.0.2 rlang_1.1.0 pkgconfig_2.0.3 bitops_1.0-7 lattice_0.20-45 Rhdf5lib_1.20.0
[85] GenomicAlignments_1.34.0 bit_4.0.5 tidyselect_1.2.0 plyr_1.8.8 magrittr_2.0.3 R6_2.5.1
[91] generics_0.1.3 DelayedArray_0.24.0 DBI_1.1.3 pillar_1.9.0 KEGGREST_1.38.0 RCurl_1.98-1.9
[97] tibble_3.2.1 crayon_1.5.2 utf8_1.2.3 BiocFileCache_2.6.0 tzdb_0.3.0 progress_1.2.2
[103] locfit_1.5-9.6 grid_4.2.1 data.table_1.14.8 blob_1.2.3 digest_0.6.31 xtable_1.8-4
[109] httpuv_1.6.6 regioneR_1.30.0 outliers_0.15 R.utils_2.12.2 munsell_0.5.0

— Reply to this email directly, view it on GitHub https://github.com/haowulab/DSS/issues/41#issuecomment-1815209862 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7H4N5RNWUKQ3PIWSOPULLYEZUZPAVCNFSM6AAAAAA6472JXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJVGIYDSOBWGI . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AD7H4N4LHOUZJLQIGHZUGDLYEZUZPA5CNFSM6AAAAAA6472JXKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTMGHTYM.gif Message ID: @. @.> >

KristenKrolick commented 11 months ago

@haowulab , Question about your coverage reply:

You can certainly filter the sites by coverage, but it’s not necessary. DSS will consider the coverage in the model. If the coverage is very shallow, it’s very unlikely that the site will be called as DML. This said, it doesn’t hurt to filter. On Nov 12, 2023, at 12:36 AM, KristenKrolick @.***> wrote: Oh! And Q4) Your DSS manual outlines reasoning for why one would not need to subset their whole-genome bisulfite seq data by minimum coverage threshold: does the same apply for the general hypothesis testing design that I am using?? ...I would think I would want to subset my data by at least mincov 4 or so, just to qualify that I am getting an accurate methylation call since my seq. data is from whole blood (mixed cells)... Thank you! — Reply to this email directly, view it on GitHub <#41 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7H4NYJPWGOJBPVKEJFIL3YD6SPRAVCNFSM6AAAAAA6472JXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBWHA3DAMZWGQ. You are receiving this because you were mentioned.

I performed below just to get those sites sig after surgery (not associated with pain yet). But I am getting significant loci being returned that are lower coverage. I was trying to find more detailed info for how you take coverage into consideration during the stats, can you please provide? Also can you please check my code below to make sure I am not missing a step?-- like for example, I do not have to perform the smoothing step or anything?? Thank you! Kristen

require(bsseq)
library(dmrseq)
files <-list.files("/data/path/bismarkcovfiles")
files
setwd("/data/path/bismarkcovfiles")
bismarkBSseq <- read.bismark(files,
                             loci = NULL,
                             colData = NULL,
                             rmZeroCov = FALSE,
                             strandCollapse = TRUE,
                             verbose = TRUE)
#covar dataframe consisting of columns of cell type estimates and rows consisting of each sample:
## (preop samples 1 through 51 as rows 1 through 51 and then postop samples as rows 52 - 102)
covar <- read.csv("/data/path/CellTypeCovar_same51Visit1_2.csv")

Treatment = factor(rep(c("Preop","Postop"),each = 51))
pair = factor(rep(1:51,2))
design = data.frame(Treatment, pair, covar)

#need to filter bismarkBSseq by chromosome because my analysis is too large to run 
chr1_bismarkBSseq<- chrSelectBSseq(bismarkBSseq, seqnames = "chr1", order = TRUE)

DMLfit = DMLfit.multiFactor(chr1_bismarkBSseq, design, formula = ~Treatment + pair + Mono_3 + NK_3 + B_TOTAL + CD4_TOTAL + CD8_TOTAL + GRANULOCYTES_TOTAL + (Treatment*response))
dmlTest = DMLtest.multiFactor(DMLfit, term = "Treatment")