hhg7 / defiant

Differential methylation: Easy, Fast, Identification and ANnoTation
GNU General Public License v2.0
17 stars 9 forks source link

Error: Segmentation fault (core dumped) #13

Closed SowmyaPulapet closed 1 year ago

SowmyaPulapet commented 1 year ago

Hi,

I am using defiant to do DMR analysis for my two WGBS samples. I used Bismark Coverage2cytosine output as input for defiant. The command used to run defiant is:

defiant -L case1,case2 -l SK2_vs_SK6 -a annotation.gtf -cpu 6 -fdr bh -i case1.CpG_report.txt case2.CpG_report.txt

But I am getting the following error upon running:

script.sh: line 2: 159754 Segmentation fault (core dumped)

I have sufficient memory in my server (256 GB). Please Help me to solve this error.

Thank you.

hhg7 commented 1 year ago

Hi @SowmyaPulapet could you show me the file annotation.gtf?

Also, does the error occur without using the annotation?

SowmyaPulapet commented 1 year ago

Hi @hhg7 ,

The error doesn't occur when running without annotation but I am not getting a proper result either. A file called case1_vs_case2_c10_CpN5_d1_p0.05_P10.tsv but it only have header inside. I am attaching the log her. I couldn't attach the annotation.gtf (44.2 mb), let me know if there is any other way I can send this.

Thank you.

run_info.txt

hhg7 commented 1 year ago

Hi @SowmyaPulapet

there may be 0 DMRs because your parameters are too selective, you can try something like:

defiant -CpN 1,5,1 -c 2,10,2 -P 5,10,1 -L case1,case2 -l SK2_vs_SK6 -a annotation.gtf -cpu 6 -fdr bh -i case1.CpG_report.txt case2.CpG_report.txt

which will give you a better impression of what parameters are most appropriate for your data. Different parameters are appropriate for different data, and this should show you how many DMRs you get depending on the parameters.

The above code will try minimum coverage of 2,4,6,8,10 with minimum percents of 5,6,7,8,9,10, with minimum CpN of 1,2,3,4,5. That command will take a while to run, but at 6 CPU it shouldn't take too long.

With regard to the annotation, could you provide a short example of it, or even the file itself? I think that one of the strings may be too long.

SowmyaPulapet commented 1 year ago

Hi @hhg7 ,

So I ran the command you gave without providing the annotation.gtf (segmentation fault when using annotation.gtf) but it died at fdr.h line 408 and the output files created only have header in it.

For some reason I am not being able to attach annotation.gtf, so I will add few lines here:

Pn1   EVM transcript  10842   11267   .   +   .   transcript_id "evm.model.Pn1.1"; gene_id "evm.model.Pn1.1";
Pn1   EVM exon    10842   11267   .   +   .   transcript_id "evm.model.Pn1.1";
Pn1   EVM CDS 10842   11267   .   +   0   transcript_id "evm.model.Pn1.1";
Pn1   EVM transcript  19952   20188   .   +   .   transcript_id "evm.model.Pn1.2"; gene_id "evm.model.Pn1.2";
Pn1   EVM exon    19952   20188   .   +   .   transcript_id "evm.model.Pn1.2";
Pn1   EVM CDS 19952   20188   .   +   0   transcript_id "evm.model.Pn1.2";
Pn1   EVM transcript  67496   68035   .   -   .   transcript_id "evm.model.Pn1.3"; gene_id "evm.model.Pn1.3";
Pn1   EVM exon    67496   68035   .   -   .   transcript_id "evm.model.Pn1.3";
Pn1   EVM CDS 67496   68035   .   -   0   transcript_id "evm.model.Pn1.3";
Pn1   EVM transcript  68921   69345   .   -   .   transcript_id "evm.model.Pn1.4"; gene_id "evm.model.Pn1.4";
Pn1   EVM exon    68921   69087   .   -   .   transcript_id "evm.model.Pn1.4";
Pn1   EVM exon    69249   69345   .   -   .   transcript_id "evm.model.Pn1.4";
Pn1   EVM CDS 68921   69087   .   -   2   transcript_id "evm.model.Pn1.4";
Pn1   EVM CDS 69249   69345   .   -   0   transcript_id "evm.model.Pn1.4";
Pn1   EVM transcript  73564   74589   .   -   .   transcript_id "evm.model.Pn1.5"; gene_id "evm.model.Pn1.5";
Pn1   EVM exon    73564   74589   .   -   .   transcript_id "evm.model.Pn1.5";
Pn1   EVM CDS 73564   74589   .   -   0   transcript_id "evm.model.Pn1.5";
Pn1   EVM transcript  76279   78348   .   -   .   transcript_id "evm.model.Pn1.6"; gene_id "evm.model.Pn1.6";

I am attaching the complete log for the recent run here.

run_log.txt

Thank you.

hhg7 commented 1 year ago

Hi @SowmyaPulapet

I don't think that this gtf is formatted well, the wide variety of formats available for GTF was difficult to program.

What does each column mean? "Pn1" and "EVM" could mean either chromosome or gene name, I'm not sure which.

As for debugging, could you send your data? Truncated data should work too. I would be able to help better.

A script like this could help translate your gtf to refFlat format, but as I'm not sure which column is which in your file, I cannot guarantee that this will work

#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';
use autodie ':all';
use Scalar::Util 'looks_like_number';

open my $gtf, '>', $ARGV[0];
open my $refFlat, '>', "$ARGV[0].refFlat";
while (<$gtf>) {
    next if /^#/; # skip comment lines
    my @l = split /\t/;
    my $chr = $l[0];
    my $start = $l[3];
    my $end = $l[4];
    foreach my $n ($start, $end) {
        if (not looks_like_number($n)) {
            die "$n isn't a number.";
        }
    }
    my $sense = $l[6];
    if ($sense !~ m/^[+-]$/) {
        die "$sense is nother \"+\" nor \"-\".";
    }

    my $gene_id;
    if ($l[8] =~ m/gene_id "([^"]+)/) {
        $gene_id = $1
    } else {
        die "Can't find a gene in $_";
    }
    my $name;
    if ($l[8] =~ m/gene_name "([^"]+)/) {
        $name = $1
    } else {
        $name = $gene_id
    }
#   say $l[8]; exit;
    print $refFlat join ("\t", $name, $gene_id, $chr, $sense, $start, $end) . "\n";
}
SowmyaPulapet commented 1 year ago

Hi @hhg7,

I understand that the reference is not in a standard fast. Unfortunately it is the only reference available currently so no other choice. I got it from here. 'Pn1' refers to the chromosome number but am not sure about 'EVM'.

I am also attaching a small part from my data here.

case1.CpG_report.txt case2.CpG_report.txt

hhg7 commented 1 year ago

Hi @SowmyaPulapet

could I try the full data set?

Also, try without "fdr" I put that there only for the reviewers, it's not mathematically valid.

SowmyaPulapet commented 1 year ago

Hi @hhg7,

I ran again removing the "fdr" and I got results with -CpN 4 like this:

Chromosome | Start | End | #mCpN | #Diff.CpN | Mean_Difference | case1 | case2 Pn1 | 70823 | 71225 | 9 | 4 | -20.3 | [35.1] | [14.8] Pn1 | 1430913 | 1430957 | 5 | 3 | 21.6 | [68.8] | [90.4] Pn1 | 1981948 | 1982542 | 5 | 2 | 19.7 | [20.3] | [40.0] Pn1 | 2306572 | 2306691 | 5 | 2 | -27.1 | [28.6] | [1.5] Pn1 | 2725270 | 2725441 | 8 | 6 | -30.7 | [43.9] | [13.3] Pn1 | 2728632 | 2728775 | 4 | 3 | -34.7 | [78.3] | [43.7] Pn1 | 2892715 | 2893040 | 4 | 3 | -31.6 | [85.2] | [53.6] Pn1 | 3369529 | 3369585 | 7 | 3 | -32.2 | [35.6] | [3.4] Pn1 | 3843196 | 3843235 | 4 | 2 | -16.9 | [18.2] | [1.3]

I guess this is the result I should be getting. I got around ~2500 DMR's, I don't know whether this is the ideal count or is it too low or high. Also with -CpN 5 I got ~1600 DMR's. Which one should I consider ?

Thank you.

hhg7 commented 1 year ago

Hi @SowmyaPulapet

the number of DMRs is up to you, the choice is between sensitivity and selectivity.

I personally would use a longer CpN number to be more conservative, and sure that the results that you're reporting are correct.

Regarding the false discovery rate, think about the p-value cutoff of N CpG, if you have a minimum of 5 CpG in a row, the chance that all 5 sequentially would be < 0.05 is about 1 in 3,200,000 (1/20^5), which should be more than sufficient.

Please close the issue if you are satisfied with my responses, otherwise I'm happy to answer any questions that you may have.

SowmyaPulapet commented 1 year ago

Hi @hhg7,

Yes my issue is solved. I would like to know about the -heatmap option provided. What values are used to plot the heat map and on what basis is the gene names (number 0,1,2,3 etc) is given ? How can I know which DMR is in the plot ?

hhg7 commented 1 year ago

hi @SowmyaPulapet

the values used to make the heatmap are the % methylation for each replicate in your system.

The gene numbers are written as each DMR is found. That is, the first DMR found is the first row. Each row in the DMR plot eps file will be written in the same order that the output file writes the DMRs.

I'm not an R expert, but when you use the "-heatmap" option, you can alter the heatmap tsv file and R script so that the gene names appear in the output image.

Does this make sense?

SowmyaPulapet commented 1 year ago

Hi @hhg7 ,

Yes, that makes sense. However, the eps format is making it a bit a difficult to convert it to .pdf or .jpeg. I will find an alternate way to do that.

Thank you so much.