Closed MohamedRefaat92 closed 4 years ago
Hi @MohamedRefaat92,
Thank you for reporting this issue.
Indeed, it should be possible to test for time point and adjust for case id as you describe. I'm not sure what is causing this error - can you give me a bit more detail?
Does this table include all of your samples? How is case_id
encoded - as a numeric or factor? How many cores are you using? Do you get the same error if you try running on a very small subset of the data (e.g. 10K sites)? If the answer to the last question is yes, it would be great if you could send me a small example subset so that I may try and reproduce the error on my end.
Best, Keegan
Hi @kdkorthauer,
Thanks for your prompt reply. The following answers to your questions.
1-The table I've shared with you doesn't include all the samples. The original table includes 24 samples(12 in each class of the testCovariate).
2-The column case_id
is encoded as numeric. Does dmrseq treat the adjustCovariate as it does with the testCovariate by converting it to a factor?
3-For this analysis, I used only 2 cores using the following code register(MulticoreParam(2))
4-I haven't tried to down-sample the data before running dmrseq. But I will give it a try and let you know.
Best, Mohamed
Hi @kdkorthauer ,
I've used a very small subset for my analysis ( first 10k loci from chr1) keeping everything else the same. This time I didn't get the error message!
Hope you find this helpful.
Best, Mohamed
Hi @MohamedRefaat92,
Regarding (2), testCovariate
and adjustCovariate
will only be converted to a factor if they are character-encoded. Additionally, testCovariate
is converted to a factor if only two numeric values are present (a message is output when this happens).
It sounds like you intend for the adjustCovariate
in your case to be a factor. Can you try making that change and see if the same error arises?
Best, Keegan
Hi @kdkorthauer,
I have tried the suggested change and converted the adjustCovariate into a factor before running the analysis. I didn't get an error this time, but I get the message shown in the image below for all chromosomes. I tried to google the message but failed to understand the problem behind it.
I hope you can help me interpret this message.
Best, Mohamed
Thanks for providing the additional information. It seems that the model you are trying to fit contains more parameters than samples. Can you also provide the output of running pData(bs)
, where bs
is your BSseq
object?
Best, Keegan
Dear @kdkorthauer,
I've provided you with the required table below. Also, this is the code I use to run dmrseq
:
dmrseq_regions <- dmrseq(bs=sort(bs), cutoff = 0.05, testCovariate=time_point, adjustCovariate = case_id)
Best,
Mohamed
time_point | case_id | sample_id |
---|---|---|
t1 | 1 | t1_1 |
t2 | 1 | t2_1 |
t1 | 2 | t1_2 |
t2 | 2 | t2_2 |
t1 | 3 | t1_3 |
t2 | 3 | t2_3 |
t1 | 4 | t1_4 |
t2 | 4 | t2_4 |
t1 | 5 | t1_5 |
t2 | 5 | t2_5 |
t1 | 6 | t1_6 |
t2 | 6 | t2_6 |
t1 | 7 | t1_7 |
t2 | 7 | t2_7 |
t1 | 8 | t1_8 |
t2 | 8 | t2_8 |
t1 | 9 | t1_9 |
t2 | 9 | t2_9 |
t1 | 10 | t1_10 |
t2 | 10 | t2_10 |
t1 | 11 | t1_11 |
t2 | 11 | t2_11 |
t1 | 12 | t1_12 |
t2 | 12 | t2_12 |
Hi @MohamedRefaat92,
Thanks for sending the sample table. I still don't see any issues that would cause the degrees of freedom error. Would you mind sending a small subset of chromosome 1 that throws the error (e.g. first 10K sites)? You can upload it via Dropbox here, or send it to me via email at keegan@stat.ubc.ca. This will help me diagnose the issue and provide a fix.
Best, Keegan
Hi @kdkorthauer,
I only get the error message Not enough degree of freedom to fit the model. Drop some terms in formula
when I use all the data. For a small subset (first 10k loci), I don't get any errors.
I think that one thing I might try is to increment the size of the subset above 10k and see when do I start to get this error. This is not an efficient way to find what causes the error, so please let me know if you have a better idea.
Best, Mohamed
Hi @MohamedRefaat92,
I would try a few increments (e.g. first 100K loci, first 1M loci). Based on the output of your previous error message, the error should occur if you run only on chr1. It would be helpful, but not necessary, if you found a smaller subset to send for diagnostic purposes.
Best, Keegan
Hi @kdkorthauer,
I found out that the error occurs at a subset as small as the first 100k (shown in the image below). At this stage, would it help if I sent you the first 100k loci?
Best,
Mohamed
Hi @MohamedRefaat92,
That would be great if you could send the object with the first 100K loci that throw the error. You can upload it here. Thank you!
Best, Keegan
Hi @kdkorthauer ,
The size of the rds object containing the first 100k loci is too big (36 GB) for dropbox, which asks to only upload files under 2GBs. Is there any other mean of sending it to you?
Best, Mohamed
Hi @MohamedRefaat92,
Can you try recreating the object and saving to see if that reduces the file size. This is a problem we've had before, and that's how I was able to solve it (previously you sent a 19Gb object and resaving it reduced it to 56Mb).
Best, Keegan
Hi @kdkorthauer ,
I've downloaded the latest version of bsseq
, recreated the object by loading and resaving it again in the same way that you did last time, and it worked. The new -smaller- object has been just uploaded.
On a side note, i still find it confusing that creating the object from the Coverage and Methylation matrices produces an object of a significantly larger size:
bs <- BSseq(chr = chr,
pos = pos,
M = as.matrix(M_b), Cov = as.matrix(Cov_b),
sampleNames = sampleNames)
pryr::object_size(bs)
> 13.6 GB
or even trying to use the bs
object created in the previous code chunk in the following way:
bs_new <- BSseq(M = getCoverage(bs, type = "M"),
Cov = getCoverage(bs, type = "Cov"),
pos = start(bs),
chr = as.character(seqnames(bs)))
pData(bs_new) <- pData(bs)
pryr::object_size(bs_new)
> 13.3 GB
compared to saving the same bs
object, read it again, and do the following:
bs <- read_rds("bs_100k.rds")
bs_new <- BSseq(M = getCoverage(bs, type = "M"),
Cov = getCoverage(bs, type = "Cov"),
pos = start(bs),
chr = as.character(seqnames(bs)))
pData(bs_new) <- pData(bs)
pryr::object_size(bs_new)
> 48.7 MB
Best, Mohamed
Hi @MohamedRefaat92,
Thank you for sending the smaller object. I've just checked and unfortunately the pData
slot is empty, so I don't have the meta data. Can you add that and resend?
Re: the differences in object sizes, this a side effect of the subsetted object having pointers back to the original one. You can also try using the realize
function as discussed in this issue.
Best, Keegan
Hi @kdkorthauer ,
I've re-uploaded the object and made sure that all the associated data are there.
Best, Mohamed
Hi @MohamedRefaat92,
Thank you for updating the object with the metadata. I was able to reproduce the error that you were getting and tracked it down to the adjustCovariate
having unused factor levels. If you look at the patient_id
column in the pData()
slot of the object you sent, it has 24 values, 12 of which are unique, but it also has 25 factor levels. Perhaps this object is a subset of 12 patient ids, where there were originally 25? In any case, I have added a fix to deal this situation and drop unused factor levels. I just pushed the change, which is available on GitHub immediately, and Bioc Devel within the next day or two.
With the new version (dmrseq_1.5.13), the following code runs without error on the object:
library(dmrseq)
bs <- readRDS("Mohamed Refaat - 2019-10-12 01-46-37 - bs_100k.rds")
dmrseq_regions <- dmrseq(bs=sort(bs),
cutoff = 0.05,
testCovariate="time_point",
adjustCovariate = "patient_id")
Please try it out on your full data and let me know if that fixes the issue for you. Thanks again for reporting this, and sorry it took so long to track it down.
Best, Keegan
Hi @kdkorthauer,
I think it's finally working! I've been running dmrseq
over the whole data without getting any errors.
Thanks a lot for your effort in maintaining the package.
Best, Mohamed
Hi @kdkorthauer,
I'm sorry to report this, but the warning message regarding the model degrees of freedom is still there.
I might have either mistakenly tested the fix on a smaller subset instead of the whole data, or because of the fact that not all chromosomes show the warning message, I thought that it was working just fine, or maybe something else.
The following is the output of running dmrseq on a 100k dataset.
> dmrseq_regions <- dmrseq(bs=sort(bs[1:100000]),
+ cutoff = 0.05,
+ testCovariate=testCovariate,
+ adjustCovariate = adjustCovariate)
Assuming the test covariate time_point is a factor.
Condition: dx vs d33
Adjusting for covariate (s): patient_id
Parallelizing using 2 workers/cores (backend: BiocParallel:MulticoreParam).
Computing on 1 chromosome(s) at a time.
Detecting candidate regions with coefficient larger than 0.05 in magnitude.
...Chromosome chr1: Smoothed (0.04 min). 1394 regions scored (0.64 min).
* 1394 candidates detected
Performing unrestricted permutation of covariate of interest across samples to generate a null distribution of region test statis
tics
Beginning permutation 1
...Chromosome chr1: Smoothed (0.04 min). 5 CpG(s) excluded due to zero coverage. Not enough degree of freedom to fit the model. D
rop some terms in formula
1568 regions scored (0.79 min).
* 1 out of 10 permutations completed (1568 null candidates)
Beginning permutation 2
...Chromosome chr1: Smoothed (0.04 min). 23 CpG(s) excluded due to zero coverage. Not enough degree of freedom to fit the model. Drop some terms in formula
some terms in formula
1535 regions scored (0.73 min).
* 2 out of 10 permutations completed (1535 null candidates)
Beginning permutation 3 some terms in formula
...Chromosome chr1: Smoothed (0.04 min). 20 CpG(s) excluded due to zero coverage. Not enough degree of freedom to fit the model. Drop
some terms in formula
Not enough degree of freedom to fit the model. Drop some terms in formula
1582 regions scored (0.68 min).
* 3 out of 10 permutations completed (1582 null candidates)
some terms in formula
Beginning permutation 4
...Chromosome chr1: Smoothed (0.05 min). 26 CpG(s) excluded due to zero coverage. Not enough degree of freedom to fit the model. Drop
some terms in formula
1609 regions scored (0.86 min).
* 4 out of 10 permutations completed (1609 null candidates)
Beginning permutation 5
...Chromosome chr1: Smoothed (0.04 min). 8 CpG(s) excluded due to zero coverage. 1703 regions scored (0.77 min).
* 5 out of 10 permutations completed (1703 null candidates)
Beginning permutation 6
...Chromosome chr1: Smoothed (0.04 min). 19 CpG(s) excluded due to zero coverage. 1739 regions scored (0.78 min).
* 6 out of 10 permutations completed (1739 null candidates)
Beginning permutation 7
...Chromosome chr1: Smoothed (0.04 min). 16 CpG(s) excluded due to zero coverage. 1397 regions scored (0.65 min).
* 7 out of 10 permutations completed (1397 null candidates)
Beginning permutation 8
...Chromosome chr1: Smoothed (0.04 min). 14 CpG(s) excluded due to zero coverage. 1345 regions scored (0.66 min).
* 8 out of 10 permutations completed (1345 null candidates) some terms in formula
Beginning permutation 9
...Chromosome chr1: Smoothed (0.04 min). 14 CpG(s) excluded due to zero coverage. Not enough degree of freedom to fit the model. Drop
some terms in formula
1467 regions scored (0.83 min).
* 9 out of 10 permutations completed (1467 null candidates)
Beginning permutation 10
...Chromosome chr1: Smoothed (0.04 min). 23 CpG(s) excluded due to zero coverage. 1628 regions scored (0.66 min).
* 10 out of 10 permutations completed (1628 null candidates)
> packageVersion("dmrseq")
[1] ‘1.6.0’
Best, Mohamed
Hi @MohamedRefaat92,
This warning message shows up in the permutation calculations when the null model can't be fit due to too many missing (zero coverage) CpGs in one or the other permuted sample groups. When this happens, these regions are ignored for the purposes of calculating p-values. As long as you don't see the message when detecting the candidate regions and only in the permutations (as you show above), you can safely ignore the warning. If you want to reduce the occurrence of this issue, I recommend more stringent filtering of CpGs prior to running dmrseq (require more than just one sample in each group with coverage at each CpG).
I will work on including more informative warning messages and documentation in the vignette so that this is more clear. Thanks for bringing this up!
Best, Keegan
Hi, Below is part of the table of my experiment. I used _timepoint as a testCovariate and it worked pretty well. But when I added _caseid as an adjustCovariate it gave me the following error.
first error: error code 1 from Lapack routine 'dgesdd'
Additionally: Warning message:
stop worker failed:
attempt to select less than one element in OneIndex
What I'm trying to do is to compare the two samples (*_a,*_b) that belong to the same case_id at different time points. Is this attainable using dmrseq?
Best, Mohamed