statgen / EPACTS

GNU General Public License v3.0
34 stars 20 forks source link

EPACTS tool support of GRCh38 #20

Open ana-stankovic opened 4 years ago

ana-stankovic commented 4 years ago

Hello, I have files run on both GRCh37 and 38 reference and would like to use EPACTS tools to process them.

In the post on the EPACTS Google Group ( here ) the question of tool support for GRCh38 was asked and the answer was given to change the EPACTS/scripts/epacts.pm file. But modifying the file by hard coding for one or the other is not something that is compatible with my implementation. Is there another way to achieve compatibility for both references?

Thank you.

jonathonl commented 4 years ago

Can you try using the --ref option? See epacts <command> --help for more information.

ana-stankovic commented 4 years ago

What I found in the code from EPACTS/scripts/epacts.pm file, you can see that the tool is hard-coded for GRCh37 - chromosome names and lengths that correspond to 37 reference, but not 38.

BEGIN {
## Variables below are hard-coded based on GRCh37
    @chrs = (1..22,"X","Y","MT");
    @szchrs = qw(249250621 243199373 198022430 191154276 180915260 171115067 159138663 146364022 141213431 135534747 135006516 133851895 115169878 107349540 102531392 90354753 81195210 78077248 59128983 63025520 48129895 51304566 155270560 59373566 16569);
    %ichrs = ();

    @cumszchrsMb = (0);
    for(my $i=0; $i < @chrs; ++$i) {
    push(@cumszchrsMb,$szchrs[$i]/1e6+$cumszchrsMb[$i]);
    $ichrs{$chrs[$i]} = $i;
    }
}

With that in mind, are you sure that giving -ref as GRCh38 would result in correct values?

jonathonl commented 4 years ago

It probably hasn't made it to the master branch yet, but this was fixed with https://github.com/statgen/EPACTS/commit/60fe5db5b1f57bfae3e57330bdb7f6729ac0f1e0. I would recommend using the latest pre-release (https://github.com/statgen/EPACTS/releases).

ana-stankovic commented 4 years ago

Hello, thank you for your response. Per your recommendation, I have installed the pre-release version and run with the 38 reference file. However, in the logs I see that it is still using files for hg19: Load gene file /usr/local/share/EPACTS/hg19_gencodeV14.txt.gz...

With that in mind, and the parts of code that are hardcoded (like the different chromosome lengths that correspond to 37 reference that I mentioned above), how does this affect the outputs? Do you have any experience with this? Thank you

jonathonl commented 4 years ago

Which analysis are you trying to run? Can you attach the entire log file?

ana-stankovic commented 4 years ago

For now I am trying a test run for EPACTS Single Variant Test. This is the command line epacts single --vcf true_1000G_exome_chr20_example_softFiltered.calls.vcf_GRCh38_liftover.vcf.gz --ped 1000G_dummy_pheno.ped --test b.score --pheno DISEASE --cov AGE --cov SEX --min-maf 0.001 --anno --ref GRCh38ERCC.ensembl95.fasta --out test --run 2 And the log file is attached here job.err.log For now I am just testing the implementation with Single Variant Test, but I plan to use Create Marker Group File and Create Kinship Matrix, Groupwise Burden Test, Zoom Plot and Plot, so I am interested in the effect of this on them as well.

jonathonl commented 4 years ago

I see. It's because you have the --anno option enabled. We will need to add a parameter in epacts single for specifying the gene file. In the meantime, you can remove --anno and then manually run epacts anno on the resulting *.epacts file, using --genef to specify the appropriate gene file. That is if you desire gene annotations in your results.

ana-stankovic commented 4 years ago

Thank you, this solves the problem of the gene anottation. How does the hardcoded part of the EPACTS/scripts/epacts.pm script affect results? As I mentioned, this part of the script is hardcoded:

## Variables below are hard-coded based on GRCh37
    @chrs = (1..22,"X","Y","MT");
    @szchrs = qw(249250621 243199373 198022430 191154276 180915260 171115067 159138663 146364022 141213431 135534747 135006516 133851895 115169878 107349540 102531392 90354753 81195210 78077248 59128983 63025520 48129895 51304566 155270560 59373566 16569);
    %ichrs = ();

    @cumszchrsMb = (0);
    for(my $i=0; $i < @chrs; ++$i) {
    push(@cumszchrsMb,$szchrs[$i]/1e6+$cumszchrsMb[$i]);
    $ichrs{$chrs[$i]} = $i;
    }
}

For this to be compatible for 38, do we need to change this, or is there a workaround?

jonathonl commented 4 years ago

It doesn't. The initRef function is called when an alternative reference is provided, which overrides the initial values. See https://github.com/statgen/EPACTS/blob/develop/scripts/epacts.pm#L576-L648.

ana-stankovic commented 4 years ago

Hello, thank you for all the help so far. I am having trouble finding a corresponding version of the Gene Map file (the genetic_map_GRCh37_wgs.txt.gz file from the data package) for the GRCh38. I was not able to locate it in the UCSC Genome Browser or Downloads. Thank you.

jonathonl commented 4 years ago

You can find a build 38 map file at http://csg.sph.umich.edu/locuszoom/download/recomb-hg38.tar.gz