kdkorthauer / dmrseq

R package for Inference of differentially methylated regions (DMRs) from bisulfite sequencing
MIT License
54 stars 14 forks source link

how to correctly set my testCovariate? #31

Closed MohamedRefaat92 closed 4 years ago

MohamedRefaat92 commented 4 years ago

Hi, Below is part of the table of my experiment. I used _timepoint as a testCovariate and it worked pretty well. But when I added _caseid as an adjustCovariate it gave me the following error. first error: error code 1 from Lapack routine 'dgesdd' Additionally: Warning message: stop worker failed:    attempt to select less than one element in OneIndex

What I'm trying to do is to compare the two samples (*_a,*_b) that belong to the same case_id at different time points. Is this attainable using dmrseq?

Best, Mohamed

time_point case_id sample_id
t1 1 1_a
t2 1 1_b
t1 2 2_1
t2 2 2_b
t1 3 3_a
t2 3 3_b
kdkorthauer commented 4 years ago

Hi @MohamedRefaat92,

Thank you for reporting this issue.

Indeed, it should be possible to test for time point and adjust for case id as you describe. I'm not sure what is causing this error - can you give me a bit more detail?

Does this table include all of your samples? How is case_id encoded - as a numeric or factor? How many cores are you using? Do you get the same error if you try running on a very small subset of the data (e.g. 10K sites)? If the answer to the last question is yes, it would be great if you could send me a small example subset so that I may try and reproduce the error on my end.

Best, Keegan

MohamedRefaat92 commented 4 years ago

Hi @kdkorthauer,

Thanks for your prompt reply. The following answers to your questions.

1-The table I've shared with you doesn't include all the samples. The original table includes 24 samples(12 in each class of the testCovariate). 2-The column case_idis encoded as numeric. Does dmrseq treat the adjustCovariate as it does with the testCovariate by converting it to a factor? 3-For this analysis, I used only 2 cores using the following code register(MulticoreParam(2)) 4-I haven't tried to down-sample the data before running dmrseq. But I will give it a try and let you know.

Best, Mohamed

MohamedRefaat92 commented 4 years ago

Hi @kdkorthauer ,

I've used a very small subset for my analysis ( first 10k loci from chr1) keeping everything else the same. This time I didn't get the error message!

Hope you find this helpful.

Best, Mohamed

kdkorthauer commented 4 years ago

Hi @MohamedRefaat92,

Regarding (2), testCovariate and adjustCovariate will only be converted to a factor if they are character-encoded. Additionally, testCovariate is converted to a factor if only two numeric values are present (a message is output when this happens).

It sounds like you intend for the adjustCovariate in your case to be a factor. Can you try making that change and see if the same error arises?

Best, Keegan

MohamedRefaat92 commented 4 years ago

Hi @kdkorthauer,

I have tried the suggested change and converted the adjustCovariate into a factor before running the analysis. I didn't get an error this time, but I get the message shown in the image below for all chromosomes. I tried to google the message but failed to understand the problem behind it.

I hope you can help me interpret this message.

Best, Mohamed

image

kdkorthauer commented 4 years ago

Thanks for providing the additional information. It seems that the model you are trying to fit contains more parameters than samples. Can you also provide the output of running pData(bs), where bs is your BSseq object?

Best, Keegan

MohamedRefaat92 commented 4 years ago

Dear @kdkorthauer,

I've provided you with the required table below. Also, this is the code I use to run dmrseq: dmrseq_regions <- dmrseq(bs=sort(bs), cutoff = 0.05, testCovariate=time_point, adjustCovariate = case_id) Best, Mohamed

time_point case_id sample_id
t1 1 t1_1
t2 1 t2_1
t1 2 t1_2
t2 2 t2_2
t1 3 t1_3
t2 3 t2_3
t1 4 t1_4
t2 4 t2_4
t1 5 t1_5
t2 5 t2_5
t1 6 t1_6
t2 6 t2_6
t1 7 t1_7
t2 7 t2_7
t1 8 t1_8
t2 8 t2_8
t1 9 t1_9
t2 9 t2_9
t1 10 t1_10
t2 10 t2_10
t1 11 t1_11
t2 11 t2_11
t1 12 t1_12
t2 12 t2_12
kdkorthauer commented 4 years ago

Hi @MohamedRefaat92,

Thanks for sending the sample table. I still don't see any issues that would cause the degrees of freedom error. Would you mind sending a small subset of chromosome 1 that throws the error (e.g. first 10K sites)? You can upload it via Dropbox here, or send it to me via email at keegan@stat.ubc.ca. This will help me diagnose the issue and provide a fix.

Best, Keegan

MohamedRefaat92 commented 4 years ago

Hi @kdkorthauer,

I only get the error message Not enough degree of freedom to fit the model. Drop some terms in formula when I use all the data. For a small subset (first 10k loci), I don't get any errors.

I think that one thing I might try is to increment the size of the subset above 10k and see when do I start to get this error. This is not an efficient way to find what causes the error, so please let me know if you have a better idea.

Best, Mohamed

kdkorthauer commented 4 years ago

Hi @MohamedRefaat92,

I would try a few increments (e.g. first 100K loci, first 1M loci). Based on the output of your previous error message, the error should occur if you run only on chr1. It would be helpful, but not necessary, if you found a smaller subset to send for diagnostic purposes.

Best, Keegan

MohamedRefaat92 commented 4 years ago

Hi @kdkorthauer,

I found out that the error occurs at a subset as small as the first 100k (shown in the image below). At this stage, would it help if I sent you the first 100k loci?

Best, Mohamed image

kdkorthauer commented 4 years ago

Hi @MohamedRefaat92,

That would be great if you could send the object with the first 100K loci that throw the error. You can upload it here. Thank you!

Best, Keegan

MohamedRefaat92 commented 4 years ago

Hi @kdkorthauer ,

The size of the rds object containing the first 100k loci is too big (36 GB) for dropbox, which asks to only upload files under 2GBs. Is there any other mean of sending it to you?

Best, Mohamed

kdkorthauer commented 4 years ago

Hi @MohamedRefaat92,

Can you try recreating the object and saving to see if that reduces the file size. This is a problem we've had before, and that's how I was able to solve it (previously you sent a 19Gb object and resaving it reduced it to 56Mb).

Best, Keegan

MohamedRefaat92 commented 4 years ago

Hi @kdkorthauer ,

I've downloaded the latest version of bsseq, recreated the object by loading and resaving it again in the same way that you did last time, and it worked. The new -smaller- object has been just uploaded.

On a side note, i still find it confusing that creating the object from the Coverage and Methylation matrices produces an object of a significantly larger size:

bs <- BSseq(chr = chr,
             pos = pos,
             M = as.matrix(M_b), Cov = as.matrix(Cov_b),
             sampleNames = sampleNames)
pryr::object_size(bs)

> 13.6 GB

or even trying to use the bs object created in the previous code chunk in the following way:

bs_new <- BSseq(M = getCoverage(bs, type = "M"),
                 Cov = getCoverage(bs, type = "Cov"),
                 pos = start(bs),
                 chr = as.character(seqnames(bs)))
pData(bs_new) <- pData(bs)
pryr::object_size(bs_new)

> 13.3 GB

compared to saving the same bsobject, read it again, and do the following:

bs <- read_rds("bs_100k.rds")
bs_new <- BSseq(M = getCoverage(bs, type = "M"),

                Cov = getCoverage(bs, type = "Cov"),
                pos = start(bs),
                chr = as.character(seqnames(bs)))
pData(bs_new) <- pData(bs)
pryr::object_size(bs_new)
> 48.7 MB

Best, Mohamed

kdkorthauer commented 4 years ago

Hi @MohamedRefaat92,

Thank you for sending the smaller object. I've just checked and unfortunately the pData slot is empty, so I don't have the meta data. Can you add that and resend?

Re: the differences in object sizes, this a side effect of the subsetted object having pointers back to the original one. You can also try using the realize function as discussed in this issue.

Best, Keegan

MohamedRefaat92 commented 4 years ago

Hi @kdkorthauer ,

I've re-uploaded the object and made sure that all the associated data are there.

Best, Mohamed

kdkorthauer commented 4 years ago

Hi @MohamedRefaat92,

Thank you for updating the object with the metadata. I was able to reproduce the error that you were getting and tracked it down to the adjustCovariate having unused factor levels. If you look at the patient_id column in the pData() slot of the object you sent, it has 24 values, 12 of which are unique, but it also has 25 factor levels. Perhaps this object is a subset of 12 patient ids, where there were originally 25? In any case, I have added a fix to deal this situation and drop unused factor levels. I just pushed the change, which is available on GitHub immediately, and Bioc Devel within the next day or two.

With the new version (dmrseq_1.5.13), the following code runs without error on the object:

library(dmrseq)

bs <- readRDS("Mohamed Refaat - 2019-10-12 01-46-37 - bs_100k.rds")

dmrseq_regions <- dmrseq(bs=sort(bs), 
  cutoff = 0.05, 
  testCovariate="time_point", 
  adjustCovariate = "patient_id") 

Please try it out on your full data and let me know if that fixes the issue for you. Thanks again for reporting this, and sorry it took so long to track it down.

Best, Keegan

MohamedRefaat92 commented 4 years ago

Hi @kdkorthauer,

I think it's finally working! I've been running dmrseq over the whole data without getting any errors. Thanks a lot for your effort in maintaining the package.

Best, Mohamed

MohamedRefaat92 commented 4 years ago

Hi @kdkorthauer,

I'm sorry to report this, but the warning message regarding the model degrees of freedom is still there.

I might have either mistakenly tested the fix on a smaller subset instead of the whole data, or because of the fact that not all chromosomes show the warning message, I thought that it was working just fine, or maybe something else.

The following is the output of running dmrseq on a 100k dataset.

> dmrseq_regions <- dmrseq(bs=sort(bs[1:100000]),
+                          cutoff = 0.05,
+                          testCovariate=testCovariate,
+                          adjustCovariate = adjustCovariate)
Assuming the test covariate time_point is a factor.
Condition: dx vs d33
Adjusting for covariate (s): patient_id
Parallelizing using 2 workers/cores (backend: BiocParallel:MulticoreParam).
Computing on 1 chromosome(s) at a time.

Detecting candidate regions with coefficient larger than 0.05 in magnitude.
...Chromosome chr1: Smoothed (0.04 min). 1394 regions scored (0.64 min).
* 1394 candidates detected
Performing unrestricted permutation of covariate of interest across samples to generate a null distribution of region test statis
tics

Beginning permutation 1
...Chromosome chr1: Smoothed (0.04 min). 5 CpG(s) excluded due to zero coverage. Not enough degree of freedom to fit the model. D
rop some terms in formula
1568 regions scored (0.79 min).
* 1 out of 10 permutations completed (1568 null candidates)

Beginning permutation 2
...Chromosome chr1: Smoothed (0.04 min). 23 CpG(s) excluded due to zero coverage. Not enough degree of freedom to fit the model. Drop some terms in formula
 some terms in formula
1535 regions scored (0.73 min).
* 2 out of 10 permutations completed (1535 null candidates)

Beginning permutation 3                                                                                                               some terms in formula
...Chromosome chr1: Smoothed (0.04 min). 20 CpG(s) excluded due to zero coverage. Not enough degree of freedom to fit the model. Drop
 some terms in formula
Not enough degree of freedom to fit the model. Drop some terms in formula
1582 regions scored (0.68 min).
* 3 out of 10 permutations completed (1582 null candidates)
                                                                                                                                      some terms in formula
Beginning permutation 4
...Chromosome chr1: Smoothed (0.05 min). 26 CpG(s) excluded due to zero coverage. Not enough degree of freedom to fit the model. Drop
 some terms in formula
1609 regions scored (0.86 min).
* 4 out of 10 permutations completed (1609 null candidates)

Beginning permutation 5
...Chromosome chr1: Smoothed (0.04 min). 8 CpG(s) excluded due to zero coverage. 1703 regions scored (0.77 min).
* 5 out of 10 permutations completed (1703 null candidates)

Beginning permutation 6
...Chromosome chr1: Smoothed (0.04 min). 19 CpG(s) excluded due to zero coverage. 1739 regions scored (0.78 min).
* 6 out of 10 permutations completed (1739 null candidates)

Beginning permutation 7
...Chromosome chr1: Smoothed (0.04 min). 16 CpG(s) excluded due to zero coverage. 1397 regions scored (0.65 min).
* 7 out of 10 permutations completed (1397 null candidates)

Beginning permutation 8
...Chromosome chr1: Smoothed (0.04 min). 14 CpG(s) excluded due to zero coverage. 1345 regions scored (0.66 min).
* 8 out of 10 permutations completed (1345 null candidates)                                                                           some terms in formula

Beginning permutation 9
...Chromosome chr1: Smoothed (0.04 min). 14 CpG(s) excluded due to zero coverage. Not enough degree of freedom to fit the model. Drop
 some terms in formula
1467 regions scored (0.83 min).
* 9 out of 10 permutations completed (1467 null candidates)

Beginning permutation 10
...Chromosome chr1: Smoothed (0.04 min). 23 CpG(s) excluded due to zero coverage. 1628 regions scored (0.66 min).
* 10 out of 10 permutations completed (1628 null candidates)
> packageVersion("dmrseq")
[1] ‘1.6.0’

Best, Mohamed

kdkorthauer commented 4 years ago

Hi @MohamedRefaat92,

This warning message shows up in the permutation calculations when the null model can't be fit due to too many missing (zero coverage) CpGs in one or the other permuted sample groups. When this happens, these regions are ignored for the purposes of calculating p-values. As long as you don't see the message when detecting the candidate regions and only in the permutations (as you show above), you can safely ignore the warning. If you want to reduce the occurrence of this issue, I recommend more stringent filtering of CpGs prior to running dmrseq (require more than just one sample in each group with coverage at each CpG).

I will work on including more informative warning messages and documentation in the vignette so that this is more clear. Thanks for bringing this up!

Best, Keegan