Closed SowmyaPulapet closed 1 year ago
Hi @SowmyaPulapet could you show me the file annotation.gtf?
Also, does the error occur without using the annotation?
Hi @hhg7 ,
The error doesn't occur when running without annotation but I am not getting a proper result either. A file called case1_vs_case2_c10_CpN5_d1_p0.05_P10.tsv but it only have header inside. I am attaching the log her. I couldn't attach the annotation.gtf (44.2 mb), let me know if there is any other way I can send this.
Thank you.
Hi @SowmyaPulapet
there may be 0 DMRs because your parameters are too selective, you can try something like:
defiant -CpN 1,5,1 -c 2,10,2 -P 5,10,1 -L case1,case2 -l SK2_vs_SK6 -a annotation.gtf -cpu 6 -fdr bh -i case1.CpG_report.txt case2.CpG_report.txt
which will give you a better impression of what parameters are most appropriate for your data. Different parameters are appropriate for different data, and this should show you how many DMRs you get depending on the parameters.
The above code will try minimum coverage of 2,4,6,8,10 with minimum percents of 5,6,7,8,9,10, with minimum CpN of 1,2,3,4,5. That command will take a while to run, but at 6 CPU it shouldn't take too long.
With regard to the annotation, could you provide a short example of it, or even the file itself? I think that one of the strings may be too long.
Hi @hhg7 ,
So I ran the command you gave without providing the annotation.gtf (segmentation fault when using annotation.gtf) but it died at fdr.h line 408 and the output files created only have header in it.
For some reason I am not being able to attach annotation.gtf, so I will add few lines here:
Pn1 EVM transcript 10842 11267 . + . transcript_id "evm.model.Pn1.1"; gene_id "evm.model.Pn1.1"; Pn1 EVM exon 10842 11267 . + . transcript_id "evm.model.Pn1.1"; Pn1 EVM CDS 10842 11267 . + 0 transcript_id "evm.model.Pn1.1"; Pn1 EVM transcript 19952 20188 . + . transcript_id "evm.model.Pn1.2"; gene_id "evm.model.Pn1.2"; Pn1 EVM exon 19952 20188 . + . transcript_id "evm.model.Pn1.2"; Pn1 EVM CDS 19952 20188 . + 0 transcript_id "evm.model.Pn1.2"; Pn1 EVM transcript 67496 68035 . - . transcript_id "evm.model.Pn1.3"; gene_id "evm.model.Pn1.3"; Pn1 EVM exon 67496 68035 . - . transcript_id "evm.model.Pn1.3"; Pn1 EVM CDS 67496 68035 . - 0 transcript_id "evm.model.Pn1.3"; Pn1 EVM transcript 68921 69345 . - . transcript_id "evm.model.Pn1.4"; gene_id "evm.model.Pn1.4"; Pn1 EVM exon 68921 69087 . - . transcript_id "evm.model.Pn1.4"; Pn1 EVM exon 69249 69345 . - . transcript_id "evm.model.Pn1.4"; Pn1 EVM CDS 68921 69087 . - 2 transcript_id "evm.model.Pn1.4"; Pn1 EVM CDS 69249 69345 . - 0 transcript_id "evm.model.Pn1.4"; Pn1 EVM transcript 73564 74589 . - . transcript_id "evm.model.Pn1.5"; gene_id "evm.model.Pn1.5"; Pn1 EVM exon 73564 74589 . - . transcript_id "evm.model.Pn1.5"; Pn1 EVM CDS 73564 74589 . - 0 transcript_id "evm.model.Pn1.5"; Pn1 EVM transcript 76279 78348 . - . transcript_id "evm.model.Pn1.6"; gene_id "evm.model.Pn1.6";
I am attaching the complete log for the recent run here.
Thank you.
Hi @SowmyaPulapet
I don't think that this gtf is formatted well, the wide variety of formats available for GTF was difficult to program.
What does each column mean? "Pn1" and "EVM" could mean either chromosome or gene name, I'm not sure which.
As for debugging, could you send your data? Truncated data should work too. I would be able to help better.
A script like this could help translate your gtf to refFlat format, but as I'm not sure which column is which in your file, I cannot guarantee that this will work
#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use autodie ':all';
use Scalar::Util 'looks_like_number';
open my $gtf, '>', $ARGV[0];
open my $refFlat, '>', "$ARGV[0].refFlat";
while (<$gtf>) {
next if /^#/; # skip comment lines
my @l = split /\t/;
my $chr = $l[0];
my $start = $l[3];
my $end = $l[4];
foreach my $n ($start, $end) {
if (not looks_like_number($n)) {
die "$n isn't a number.";
}
}
my $sense = $l[6];
if ($sense !~ m/^[+-]$/) {
die "$sense is nother \"+\" nor \"-\".";
}
my $gene_id;
if ($l[8] =~ m/gene_id "([^"]+)/) {
$gene_id = $1
} else {
die "Can't find a gene in $_";
}
my $name;
if ($l[8] =~ m/gene_name "([^"]+)/) {
$name = $1
} else {
$name = $gene_id
}
# say $l[8]; exit;
print $refFlat join ("\t", $name, $gene_id, $chr, $sense, $start, $end) . "\n";
}
Hi @hhg7,
I understand that the reference is not in a standard fast. Unfortunately it is the only reference available currently so no other choice. I got it from here. 'Pn1' refers to the chromosome number but am not sure about 'EVM'.
I am also attaching a small part from my data here.
Hi @SowmyaPulapet
could I try the full data set?
Also, try without "fdr" I put that there only for the reviewers, it's not mathematically valid.
Hi @hhg7,
I ran again removing the "fdr" and I got results with -CpN 4 like this:
Chromosome | Start | End | #mCpN | #Diff.CpN | Mean_Difference | case1 | case2 Pn1 | 70823 | 71225 | 9 | 4 | -20.3 | [35.1] | [14.8] Pn1 | 1430913 | 1430957 | 5 | 3 | 21.6 | [68.8] | [90.4] Pn1 | 1981948 | 1982542 | 5 | 2 | 19.7 | [20.3] | [40.0] Pn1 | 2306572 | 2306691 | 5 | 2 | -27.1 | [28.6] | [1.5] Pn1 | 2725270 | 2725441 | 8 | 6 | -30.7 | [43.9] | [13.3] Pn1 | 2728632 | 2728775 | 4 | 3 | -34.7 | [78.3] | [43.7] Pn1 | 2892715 | 2893040 | 4 | 3 | -31.6 | [85.2] | [53.6] Pn1 | 3369529 | 3369585 | 7 | 3 | -32.2 | [35.6] | [3.4] Pn1 | 3843196 | 3843235 | 4 | 2 | -16.9 | [18.2] | [1.3]
I guess this is the result I should be getting. I got around ~2500 DMR's, I don't know whether this is the ideal count or is it too low or high. Also with -CpN 5 I got ~1600 DMR's. Which one should I consider ?
Thank you.
Hi @SowmyaPulapet
the number of DMRs is up to you, the choice is between sensitivity and selectivity.
I personally would use a longer CpN number to be more conservative, and sure that the results that you're reporting are correct.
Regarding the false discovery rate, think about the p-value cutoff of N CpG, if you have a minimum of 5 CpG in a row, the chance that all 5 sequentially would be < 0.05 is about 1 in 3,200,000 (1/20^5), which should be more than sufficient.
Please close the issue if you are satisfied with my responses, otherwise I'm happy to answer any questions that you may have.
Hi @hhg7,
Yes my issue is solved. I would like to know about the -heatmap option provided. What values are used to plot the heat map and on what basis is the gene names (number 0,1,2,3 etc) is given ? How can I know which DMR is in the plot ?
hi @SowmyaPulapet
the values used to make the heatmap are the % methylation for each replicate in your system.
The gene numbers are written as each DMR is found. That is, the first DMR found is the first row. Each row in the DMR plot eps file will be written in the same order that the output file writes the DMRs.
I'm not an R expert, but when you use the "-heatmap" option, you can alter the heatmap tsv file and R script so that the gene names appear in the output image.
Does this make sense?
Hi @hhg7 ,
Yes, that makes sense. However, the eps format is making it a bit a difficult to convert it to .pdf or .jpeg. I will find an alternate way to do that.
Thank you so much.
Hi,
I am using defiant to do DMR analysis for my two WGBS samples. I used Bismark Coverage2cytosine output as input for defiant. The command used to run defiant is:
defiant -L case1,case2 -l SK2_vs_SK6 -a annotation.gtf -cpu 6 -fdr bh -i case1.CpG_report.txt case2.CpG_report.txt
But I am getting the following error upon running:
I have sufficient memory in my server (256 GB). Please Help me to solve this error.
Thank you.