konradjk / loftee

MIT License
172 stars 55 forks source link

Discrepancies between output of the plugin and LOF annoations in Gnomad #50

Open dolevrahat opened 5 years ago

dolevrahat commented 5 years ago

Hello

I analysed some variants using the LOFTEE plugin, and have also extracted the LOFTEE annotations for the same variants from the Gnomad API, and I found discrepancies in the calls between the LOFTEE plugin and Gnomad on some variants, to the extant that some variants are classified by Gnomad as high confidence LOF, but are classified by the plugin as low confidence LOF.

Many of these discrepant are flagged by the plugin (but not by Gnomad), as END_TRUNC, and are assigned by the plugin GERP_DIST 0, whereas the GERP_DIST in the Gnomad annotations is non-zero.

For example, for the variant 1-152081684-TTCTG-T I got from the plugin the following LOF info: GERP_DIST:0,BP_DIST:1824,PERCENTILE:0.687242798353909,DIST_FROM_LAST_EXON:-3869,50_BP_RULE:FAIL,PHYLOCSF_TOO_SHORT1-152081684-TTCTG-T

And this from Gnomad: GERP_DIST:208.392210000001,BP_DIST:1824,PERCENTILE:0.687242798353909,DIST_FROM_LAST_EXON:-3869,50_BP_RULE:FAIL,PHYLOCSF_TOO_SHORT

When inspecting the log file, I see repeated warnings such as:

 WARNING: 33436 : Use of uninitialized value in split at human-vep-v14/resources/vep_plugins/Plugins/LoF.pm line 573, <__ANONIO__> line 283.
Use of uninitialized value $number_of_exons in subtraction (-) at human-vep-v14/resources/vep_plugins/Plugins/LoF.pm line 586, <__ANONIO__> line 283.
Use of uninitialized value in split at human-vep-v14/resources/vep_plugins/Plugins/LoF.pm line 573, <__ANONIO__> line 283

Can you please help me understand these discrepancies?

Best wishes Dolev

konradjk commented 5 years ago

Can you share the command call you used to run VEP? I feel like I've seen this before with a missing file dependency, but not sure.

dolevrahat commented 5 years ago

Sure

ensembl-vep/vep --force_overwrite -i $infile -o  $outfile \
 --fork=4 --cache --offline --dir_cache . --everything --no_stats
 -dir_plugins human-vep-v14/resources/vep_plugins/Plugins/ \
 --plugin LoF,human_ancestor_fa:human_ancestor.fa.gz,loftee_path:loftee,conservation_file:phylocsf_gerp.sql \
--plugin SpliceDist --plugin PeptideLength

Thank you!

konradjk commented 5 years ago

Whoops, my apologies, this appears to be undocumented on the GRCh37 version. You'll need to download the GERP_scores.final.sorted.txt.gz and corresponding .tbi file from: https://personal.broadinstitute.org/konradk/loftee_data/GRCh37/ and then point loftee to them using ...,conservation_file:phylocsf_gerp.sql,gerp_file:/path/to/GERP_scores.final.sorted.txt.gz

dolevrahat commented 5 years ago

Thank you! I will repeat the analysis with the GERP score file and will let you know if it solved the issue. It may take me a few days to get around to this.

dolevrahat commented 5 years ago

Thank you! I will repeat the analysis with the GERP score file and will update if it solved the issue. It may take me a few days to get around to this,

adobbyn commented 5 years ago

Hello Konrad,

I'm running loftee using GRCh38, and I'm seeing the same warning:

Use of uninitialized value in split at /hpc/packages/minerva-centos7/vep/96/Plugins/LoF.pm line 573, <__ANONIO__> line 43002.
Use of uninitialized value $number_of_exons in subtraction (-) at /hpc/packages/minerva-centos7/vep/96/Plugins/LoF.pm line 586, <__ANONIO__> line 43002.

My command is:

/hpc/packages/minerva-centos7/vep/96/vep-release/vep \
    -i /sc/hydra/projects/GENECAD/dobbya01/hkop/data/SINAI_Freeze_Two.GL.pVCF.PASS.chr22.sites.vcf \
    --format vcf \
    --cache \
    --dir_cache /sc/hydra/projects/GENECAD/dobbya01/software/vep \
    --assembly GRCh38 \
    --dir_plugins /hpc/packages/minerva-centos7/vep/96/Plugins \
    --plugin LoF,loftee_path:/hpc/packages/minerva-centos7/vep/96/loftee,human_ancestor_fa:/sc/hydra/projects/GENECAD/dobbya01/software/vep/human_ancestor/human_ancestor.fa.gz,conservation_file:/sc/hydra/projects/GENECAD/dobbya01/software/vep/conservation/phylocsf_gerp.sql \
    --output_file /sc/hydra/projects/GENECAD/dobbya01/hkop/data/SINAI_Freeze_Two.GL.pVCF.PASS.chr22.sites.vep.loftee.pickallele.everything.vcf \
    --force_overwrite \
    --offline \
    --pick_allele \
    --canonical \
    --everything \
    --vcf \
    --buffer_size 1000

I also tried specifying what I thought might be the corresponding GRCh38 GERP file as gerp_file:/sc/hydra/projects/GENECAD/dobbya01/software/vep/GERP/gerp_conservation_scores.homo_sapiens.GRCh38.bw but that didn't resolve the issue. Please let me know if there's something I'm missing.

Thanks in advance, Amanda

oleraj commented 4 years ago

Hi @konradjk ,

I have the same error message and narrowed it down to a single repeatable variant. It seems that sometimes VEP doesn't report the exon number, which is not accounted for in your code. For example, this variant:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  DS-101668
22  51042513    rs140511    C   CG  45139.1 PASS    AC=238;AF=1;AN=238;DB;DP=1154;ExcessHet=3.0103;FS=0;InbreedingCoeff=-0.0275;MLEAC=238;MLEAF=1;MQ=60.05;POSITIVE_TRAIN_SITE;QD=27.36;SOR=2.421;VQSLOD=2.96;culprit=FGT:AD:DP:GQ:PGT:PID:PL   1/1:0,2:2:6:.:.:76,6,0

When I run VEP (vep96) with this command:

$EBROOTVEP/ensembl-vep/vep --cache --dir $EBROOTVEP/.vep  --assembly GRCh37 --offline --force_overwrite --vcf --fasta $EBROOTVEP/.vep/homo_sapiens/96_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz -i $vcf -o $out

I get this output (I separated out each CSQ transcript into a separate line for easy readability):

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID">
22  51042513    rs140511    C   CG  45139.1 PASS    AC=238;AF=1;AN=238;DB;DP=1154;ExcessHet=3.0103;FS=0;InbreedingCoeff=-0.0275;MLEAC=238;MLEAF=1;MQ=60.05;POSITIVE_TRAIN_SITE;QD=27.36;SOR=2.421;VQSLOD=2.96;culprit=FS;
    CSQ=
        G|intron_variant|MODIFIER|MAPK8IP2|ENSG00000008735|Transcript|ENST00000008876|protein_coding||1/11||||||||||1||HGNC|6883,
        G|frameshift_variant&splice_region_variant&intron_variant|HIGH|MAPK8IP2|ENSG00000008735|Transcript|ENST00000329492|protein_coding|||||900-901|783-784|261-262|-/X|-/G|||1||HGNC|6883,
        G|intron_variant|MODIFIER|MAPK8IP2|ENSG00000008735|Transcript|ENST00000341339|protein_coding||3/10||||||||||1||HGNC|6883,
        G|5_prime_UTR_variant|MODIFIER|MAPK8IP2|ENSG00000008735|Transcript|ENST00000399908|protein_coding|3/10||||704-705|||||||1||HGNC|6883,
        G|5_prime_UTR_variant|MODIFIER|MAPK8IP2|ENSG00000008735|Transcript|ENST00000399912|protein_coding|5/12||||919-920|||||||1||HGNC|6883,
        G|frameshift_variant&splice_region_variant&intron_variant|HIGH|MAPK8IP2|ENSG00000008735|Transcript|ENST00000442429|protein_coding|||||900-901|783-784|261-262|-/X|-/G|||1||HGNC|6883,
        G|upstream_gene_variant|MODIFIER|CHKB|ENSG00000100288|Transcript|ENST00000463053|processed_transcript|||||||||||2629|-1||HGNC|1938  GT:AD:DP:GQ:PGT:PID:PL  1/1:0,2:2:6:.:.:76,6,0

None of these have an exon number listed (first field after BIOTYPE), and some of them have an intron number listed (second field after BIOTYPE).

Here's what this variant looks like in the browser: image

It falls into one of these weird 2nt "introns" that is probably an error in the assembly. Any way we can handle this properly with LOFTEE? Maybe if a variant has consequence frameshift_variant but also intron_variant it should be ignored? Or maybe skip the ones with an empty EXON field?

Thanks!

Andrew

jhchung commented 4 years ago

@konradjk, @dolevrahat

Were you able to resolve this issue? I also have a question regarding the END_TRUNC filter and differences between the gnomAD website and my own LOFTEE runs.

One of my variants of interest is in the last exon of a transcript with two exons.

Are there some special parameters used for gnomAD annotations?

Using Docker vep version 98.3 and the most recent master of LOFTEE:


vep \
-i /mnt/{input.vcf} \
-o /mnt/{output.vep} \
--json \
--compress_output bgzip \
--format vcf \
--force_overwrite \
--fork {threads} \
--fasta /mnt/.vep/fasta/ensembl_grch37/Homo_sapiens.GRCh37.dna.fa.gz \
--assembly {params.assembly} \
--species {params.species} \
--cache \
--offline \
--dir_cache /mnt/.vep \
--cache_version {params.cache_version} \
--everything \
--dir_plugins /mnt/.vep/Plugins \
--plugin LoF,loftee_path:/mnt/loftee,human_ancestor_fa:/mnt/.vep/loftee_data/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,conservation_file:/mnt/.vep/loftee_data/phylocsf_gerp.sql,gerp_file:/mnt/.vep/loftee_data/GERP_scores.final.sorted.txt.gz```
jhchung commented 4 years ago

Quick update. Removing filter_position:0.05 option seems to make the LOFTEE output match the gnomAD results.

Is this no longer a recommended option? I also saw several references to filter_position were commented out in the master.

konradjk commented 4 years ago

It falls into one of these weird 2nt "introns" that is probably an error in the assembly. Any way we can handle this properly with LOFTEE? Maybe if a variant has consequence frameshift_variant but also intron_variant it should be ignored? Or maybe skip the ones with an empty EXON field?

Hmm, yeah I feel like we used to have an "exon_number is defined" filter and we must have removed it when we started adding intronic (predicted splice LoFs). That's definitely not ideal - unfortunately I don't have the bandwidth right now to fix loftee, but PRs welcome!

konradjk commented 4 years ago

Quick update. Removing filter_position:0.05 option seems to make the LOFTEE output match the gnomAD results.

Is this no longer a recommended option? I also saw several references to filter_position were commented out in the master.

Yes, we've switched to a "GERP-weighted number of bases" scheme for those that fail the 50 bp rule. There's a description of this decision in the supplement of the gnomAD preprint (page 34).

vserret commented 4 years ago

Hi,

I feel I have the same issue as @jhchung but I didn't manage to remove the filter_position option as he did. I tried using filter_position:0 in my cmd (below) but LOFTEE still annotates:

And if the "GERP-weighted number of bases" rule is not available in LOFTEE how could I use the LoF details in my output (below) to match gnomAD? (I am not sure I understand correctly the threshold used to annotate variants as HC in the flagship paper)

Here is the command I've used (using grch38 branch):

vep 
-i input_file.vcf
-o output_file.vcf
--cache
--fa Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--plugin LoF,loftee_path:loftee,human_ancestor_fa:human_ancestor.fa.rz,\
filter_position:0,conservation_file:gerp_conservation_scores.homo_sapiens.GRCh38.bw

Here are the LoF annotations of my two variants in the canonical isoforms:

rs601338 = LC|END_TRUNC||PERCENTILE:0.446705426356589&GERP_DIST:-217.357113099098&BP_DIST:571&DIST_FROM_LAST_EXON
rs11310407 = LC|END_TRUNC|SINGLE_EXON|PERCENTILE:0.128042328042328&GERP_DIST:-704.410578775406&BP_DIST:824&DIST_FROM_LAST_EXON:-211&50_BP_RULE:FAIL&ANN_ORF:1584.85&MAX_ORF:1584.85

Thanks in advance