liguowang / CrossMap

CrossMap is a python program to lift over genome coordinates from one genome version to another.
https://crossmap.readthedocs.io/en/latest/
Other
64 stars 23 forks source link

Does not convert ALT field in structural variant VCFs #25

Open ManavalanG opened 3 years ago

ManavalanG commented 3 years ago

Because Crossmap supports INFO/END values in SV VCFs since 0.4.3, it gives the impression that it supports lifting over of structural variants as well. However it currently doesn't convert values in ALT field for BND calls.

Here I use manta's test dataset for demo purposes:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HCC1954_BL  HCC1954
8   106641183   MantaBND:0:0:2:0:0:0:1  T   ]11:94987872]T  .   PASS    SVTYPE=BND;MATEID=MantaBND:0:0:2:0:0:0:0;SOMATIC;SOMATICSCORE=35;BND_DEPTH=26;MATE_BND_DEPTH=39 PR:SR   32,0:28,0   489,4:520,19
8   106641290   MantaBND:0:0:1:0:0:0:1  G   G]11:94975749]  .   PASS    SVTYPE=BND;MATEID=MantaBND:0:0:1:0:0:0:0;CIPOS=0,2;HOMLEN=2;HOMSEQ=AA;SOMATIC;SOMATICSCORE=36;BND_DEPTH=32;MATE_BND_DEPTH=32    PR:SR   43,0:38,0   722,9:463,15
11  95242583    MantaBND:0:0:1:0:0:0:0  G   G]8:107653520]  .   PASS    SVTYPE=BND;MATEID=MantaBND:0:0:1:0:0:0:1;CIPOS=0,2;HOMLEN=2;HOMSEQ=TT;SOMATIC;SOMATICSCORE=36;BND_DEPTH=32;MATE_BND_DEPTH=32    PR:SR   43,0:38,0   722,9:463,15
11  95242589    MantaBND:0:1:2:0:0:0:0  T   T]11:94987865]  .   PASS    SVTYPE=BND;MATEID=MantaBND:0:1:2:0:0:0:1;IMPRECISE;CIPOS=-156,156;SOMATIC;SOMATICSCORE=41;BND_DEPTH=32;MATE_BND_DEPTH=39    PR  38,0    161,13
11  95254701    MantaBND:0:1:2:0:0:0:1  A   A]11:94975753]  .   PASS    SVTYPE=BND;MATEID=MantaBND:0:1:2:0:0:0:0;IMPRECISE;CIPOS=-150,150;SOMATIC;SOMATICSCORE=41;BND_DEPTH=39;MATE_BND_DEPTH=32    PR  38,0    161,13
11  95254708    MantaBND:0:0:2:0:0:0:0  T   T[8:107653411[  .   PASS    SVTYPE=BND;MATEID=MantaBND:0:0:2:0:0:0:1;SOMATIC;SOMATICSCORE=35;BND_DEPTH=39;MATE_BND_DEPTH=26 PR:SR   32,0:28,0   489,4:520,19

Notice that coordinates in ALT column remain the same after lifting-over.

liguowang commented 3 years ago

You are right. Current Crossmap does not support SV format. We will consider adding this function in the future.

kojix2 commented 1 year ago

Hi! @liguowang Thanks for creating a useful tool. I found this tool in the book Bioinformatics Data Skills.

But, I also noticed that it crashes when converting VCF files.

A simplified example to make it easier to reproduce.

test.vcf

##fileformat=VCFv4.2
##FILTER=<ID=LOW_QUAL,Description="Low quality call">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  foo.bam
chr16   25925125    id2 N   N[chr22:20000000[   100.00  PASS    SVTYPE=BND  GT  ./.
chr16   21925125    id1 N   N[chr22:20000000[   100.00  PASS    SVTYPE=BND  GT  ./.

Convert T2T-CHM13 -> GRCh38

chain: chm13v2-grch38.chain target: GRCh38.p13.genome.fa

Command to execute

CrossMap.py vcf chm13v2-grch38.chain test.vcf GRCh38.p13.genome.fa out.vcf

Error messages.

2022-12-02 01:15:19 [INFO]  Read the chain file "chm13v2-grch38.chain" 
2022-12-02 01:15:21 [INFO]  Filter out variants [reference_allele == alternative_allele] ...
2022-12-02 01:15:21 [INFO]  Updating contig field ... 
2022-12-02 01:15:21 [INFO]  Lifting over ... 
Traceback (most recent call last):
  File "/home/kojix2/miniconda3/bin/CrossMap.py", line 319, in <module>
    crossmap_vcf_file(mapping = mapTree, infile= in_file, outfile = out_file, liftoverfile = chain_file, refgenome = genome_file, noCompAllele = args.no_comp_alleles, compress = args.compression, cstyle = args.cstyle)
  File "/home/kojix2/miniconda3/lib/python3.10/site-packages/cmmodule/mapvcf.py", line 183, in crossmap_vcf_file
    fields[4] = revcomp_DNA(fields[4], True)
  File "/home/kojix2/miniconda3/lib/python3.10/site-packages/cmmodule/utils.py", line 104, in revcomp_DNA
    return ''.join([complement[base] for base in reversed(seq)])
  File "/home/kojix2/miniconda3/lib/python3.10/site-packages/cmmodule/utils.py", line 104, in <listcomp>
    return ''.join([complement[base] for base in reversed(seq)])
KeyError: '['

The output file shows that the former record does not require revcomp_DNA() and is therefore output as is. The latter is shown to have failed during revcomp_DNA().

tail out.vcf
##contig=<ID=chrM,length=16569,assembly=GRCh38.p13.genome.fa>
##contig=<ID=chrX,length=156040895,assembly=GRCh38.p13.genome.fa>
##contig=<ID=chrY,length=57227415,assembly=GRCh38.p13.genome.fa>
##liftOverProgram=CrossMap,version=0.6.4,website=https://crossmap.readthedocs.io/en/latest/
##liftOverChainFile=chm13v2-grch38.chain
##originalFile=test.vcf
##targetRefGenome=GRCh38.p13.genome.fa
##liftOverDate=December02,2022
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  foo.bam
chr16   25648607    id2 A   N[chr22:20000000[   100.00  PASS    SVTYPE=BND  GT  ./.

I know it is a real hard task to maintain such a tool continuously for years. So I am not asking you to address this issue. I have tried the new release 0.6.5 and confirmed that this issue reproduces itself. Just reporting. Thanks.

liguowang commented 1 year ago

Please check the ALT field. It is not valid seeing from the example you provided: chr16 25648607 id2 T N[chr22:100000[ 100.00 PASS SVTYPE=BND GT ./.

kojix2 commented 1 year ago

Indeed, chm13v2 was the target and the entire command was wrong, so I have corrected it. Also, in GRCh38, chr22:1000000 is undeciphered, so I changed it to chr22:20000000 .

liguowang commented 1 year ago

CrossMap can only handle these characters in the ALT field (i.e., the 5th column): dict_keys(['A', 'C', 'G', 'T', 'Y', 'R', 'S', 'W', 'K', 'M', 'B', 'V', 'D', 'H', 'N', '.', '*'])

kojix2 commented 1 year ago

Yes. That's right. However, it should be possible to convert the ALT field for BNB. I was thinking about this the day before yesterday, but it seems difficult for my brain. It is especially complicated because it may also include an inserted sequence.

https://samtools.github.io/hts-specs/VCFv4.2.pdf

image image