liguowang / CrossMap

CrossMap is a python program to lift over genome coordinates from one genome version to another.
https://crossmap.readthedocs.io/en/latest/
Other
64 stars 23 forks source link

Unexpected mapping where a deletion turns into a SNP #47

Closed ivan-mh closed 2 years ago

ivan-mh commented 2 years ago

Dear Authors,

Thank you again for this great tool! I have been testing CrossMap with DBSNP v151 data and found one type of conversion for 715 DBSNP variants that doesn't look correct. Below, you can find one example:

The input VCF variant, GRCh37 coordinates: chr6 73311117 rs768155704 TA T . . RS=768155704

The output VCF variant in the CrossMap output, GRCh38 coordinates: chr6 72601390 rs768155704 A T . . RS=768155704

Expected VCF variant, GRCh38 coordinates: chr6 72601389 rs768155704 TA T . . RS=768155704

From the chain file, one can see that position chr6:73311117 (1-based) is not covered, whereas position chr6:73311118 (1-based) is covered and can be mapped. The Chain file contains: 6 73310721 73311116 + 6 72600994 72601389 + 6 73311117 73374388 + 6 72601389 72664660 +

Here an IGV screenshot depicting the situation:

image

For all those 715 DBSNP variants, CrossMap produces a SNP on GRCh38 for a deletion on GRCh37, which looks like a bug, would there be a plausible explanation for this behavior?

I used CrossMap (v0.6.1) and chain file: ftp://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/GRCh37_to_GRCh38.chain.gz

Many Thanks, Ivan

liguowang commented 2 years ago

I think it makes sense to change the REF allele from "TA" (hg19) to "A" (hg38), as "T" was removed from hg38. please check v0.6.3

ivan-mh commented 2 years ago

Thank you for your reply. To my understanding, the variant above is a deletion of nucleotide A at position chr6:73311118 in GRCh37. According to the chain file, this position corresponds to chr6:72601390 in GRCh38. I would thus expect to also see a deletion on GRCh38, this is also expected according to the DBSNP database (https://www.ncbi.nlm.nih.gov/snp/?term=rs768155704). This is a corresponding IGV screenshot of GRCh38:

image

In the VCF variant representation of an indel, the first nucleotide is unchanged, i.e. T in REF and ALT corresponds to the nucleotide in the GRCh37 before the deleted position chr6:73311118. On GRCh38, I would expect the same, i.e. that the first nucleotide in REF and ALT corresponds to the nucleotide before the deleted position chr6:72601390 on GRCh38.

I have also tested it with v0.6.3 and the behavior is the same, although it resolved my previous issue (#44), for which I am thankful!

Many Thanks, Ivan

ivan-mh commented 2 years ago

Dear Authors,

Sorry for yet another message. I now run the full test with more than 600 million DBSNP v151 variants and extracted 715 variants that are in my opinion not mapped correctly, see below. Would it be possible to fix this issue?

Attached is a file containing all the 715 problematic variants, where INFO.GRCh37 is the input GRCh37 variant in format CHROM:POS:REF:ALT. Fields CHROM, POS, REF, ALT contain the CrossMap v0.6.3 output on GRCh38. The ID contains the expected mapping according to DBSNP on GRCh38 in format: s;RS;GENE;CHROM:POS:REF:ALT, where the ALT can contain multiple values for one RS identifier. I am also aware that not all expected DBSNP variant mappings are correct according to the chain file I am using, where I am using: ftp://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/GRCh37_to_GRCh38.chain.gz

Best wishes, Ivan CrossMap_v063_issue_47.txt

liguowang commented 1 year ago

@ivan-mh , Sorry for the long delay. The issue (#47) is a little bit complicated. Could you please check if this beta version fix the problem? I hope this change will not mess up other variants (It does work for rs768155704). Thanks very much.

https://drive.google.com/file/d/16HsEz-CcxOgjuzN5-9Utx_Txn8D_vOAr/view?usp=sharing