monarch-initiative / Squirls

Interpretable prioritization of splice variants in diagnostic next-generation sequencing
https://squirls.readthedocs.io/en/master
Other
15 stars 4 forks source link

different nucleotide change in transcript output and sequence trekker #50

Closed noorsmal closed 1 year ago

noorsmal commented 1 year ago

Hi,

I'm running squirls v2.0.0 with the 2203 | hg38/GRCh38 database files. I run annotate-csv on my list of variants with the standard output etc.

In the html file, I see a difference between the predicted effect on transcript and what is shown in the sequence trekker. For example, a variant in CAMK2D for which the input line is: chr4,V4,113609151,C,A

The transcript table gives me several transcripts with the top one saying this genomic variant leads to c.275+1G>T, a splice_donor_variant with a SQUIRLS score of 0.350.

The sequence trekker however, shows a change from G>A.

This happens for all variants where the genomic nucleotide change is the reverse complement of the nuceotide change in the transcript.

I notice you also do not check if the variant submitted is correct, for example I can also give chr4,V4,113609151,G,T to see the "correct" sequence trekker, even though there is no G in that position in chr4.

ielis commented 1 year ago

Hi @noorsmal thank you very much for reporting this bug. I reproduced it and I will deploy a fix soon. I will keep you posted.

You are right about not checking the input. The reasons behind are to simplify the code base and to improve runtime performance. However, there are other tools for validating variant alleles in bioinformatics pipelines, for instance GATK's ValidateVariants can be run if integrity of a VCF file is in question.

I am not aware of a tool for checking the CSV file. However, the CSV file format is designed for annotation of a few variants (as mentioned in the docs). I see that the handcrafted files are prone to errors. To ensure the positions are OK, I would recommend using a service such as variant validator.

ielis commented 1 year ago

This is a dev release that will produce correct trekker. Please try it out if you can and let me know if it does not work OK.

Thank you again for reporting the bug!

squirls-cli-2.0.1-SNAPSHOT-distribution.zip

noorsmal commented 1 year ago

Hi,

I tried the 2.0.1 version, but for me it produces the same error.

my command: java -jar squirls-cli-2.0.1-SNAPSHOT/squirls-cli-2.0.1-SNAPSHOT.jar annotate-csv -d dbSquirls/ --out-dir mydata mydata/EPID_burden_denovo.csv EPID_burden_denovo_SQUIRLS2.0.1

ielis commented 1 year ago

Hi @noorsmal thanks a lot for looking into this again.

I think you are running SQUIRLS in a correct way but from some reason you are not getting the fixed results.

When I use a file example.hg38.csv with the hg38 coordinates for the variant you posted above:

CHROM,ID,POS,REF,ALT
chr4,CAMK2D,113609151,C,A

to run annotate-csv

java -jar squirls-cli-2.0.1.jar annotate-csv \
  --data-directory path/to/2203_hg38 \
  example.hg38.csv \
  squirls-strand-bug

I receive the following image

The variant seems to be OK - G>T instead of G>A.

Can you please verify that this is not happening on your side?

noorsmal commented 1 year ago

Hi,

It's fixed now. When I got version 2.0.1 from the link you sent in your comment (https://github.com/TheJacksonLaboratory/Squirls/files/11408039/squirls-cli-2.0.1-SNAPSHOT-distribution.zip), it did not work correctly. However, with the version from here: https://github.com/TheJacksonLaboratory/Squirls/releases/download/v2.0.1/squirls-cli-2.0.1-distribution.zip I did get the correct trekker sequence.

Thank you!

ielis commented 1 year ago

Hi, yes, you're right. The previous version did not work OK, sorry for that. Thank you!