ohsu-cedar-comp-hub / WGS-nextflow-workflow

Apache License 2.0
3 stars 1 forks source link

test files: chr19 and chr20 truth vcf liftover #35

Closed rlancaster96 closed 5 months ago

rlancaster96 commented 5 months ago

Liftover fails for truth VCF file for only chromosome 19 and 20 from test files available on Synapse (https://www.synapse.org/#!Synapse:syn2335184)

Command:

gatk LiftoverVcf \
     I=/home/groups/CEDAR/lancasru/WGS_COH_NF/make_test_truth/synapse_files/chr19.truth.vcf \
     O=chr19.truth.lifted_over.vcf \
     CHAIN=GRCh37_to_GRCh38.chain.gz \
     REJECT=rejected_variants_chr19.vcf \
     R=/home/groups/CEDAR/goldmael/projects/wgs_test_files/references/GRCh38.d1.vd1.fa

Expected behavior: Liftover variant information from hg19 reference to hg38 reference.

Actual behavior: No variants lifted over.

INFO    2024-04-25 13:59:01 LiftoverVcf Loading up the target reference genome.
INFO    2024-04-25 13:59:15 LiftoverVcf Lifting variants over and sorting (not yet writing the output file.)
INFO    2024-04-25 13:59:15 LiftoverVcf Processed 1739 variants.
INFO    2024-04-25 13:59:15 LiftoverVcf 1739 variants failed to liftover.
INFO    2024-04-25 13:59:15 LiftoverVcf 0 variants lifted over but had mismatching reference alleles after lift over.
INFO    2024-04-25 13:59:15 LiftoverVcf 100.0000% of variants were not successfully lifted over and written to the output.
INFO    2024-04-25 13:59:15 LiftoverVcf liftover success by source contig:
INFO    2024-04-25 13:59:15 LiftoverVcf 19: 0 / 1739 (0.0000%)
INFO    2024-04-25 13:59:15 LiftoverVcf lifted variants by target contig:
INFO    2024-04-25 13:59:15 LiftoverVcf no successfully lifted variants
WARNING 2024-04-25 13:59:15 LiftoverVcf 0 variants with a swapped REF/ALT were identified, but were not recovered.  See RECOVER_SWAPPED_REF_ALT and associated caveats.

Issue posted online? A similar issue was posted on a gatk forum. The solution to use the GRch37 to 38 chain produced the following:

INFO    2024-04-25 14:13:24 LiftoverVcf Loading up the target reference genome.
INFO    2024-04-25 14:13:36 LiftoverVcf Lifting variants over and sorting (not yet writing the output file.)
ERROR   2024-04-25 14:13:36 LiftoverVcf Encountered a contig, 19 that is not part of the target reference.

Environment The Genome Analysis Toolkit (GATK) v4.3.0.0 HTSJDK Version: 3.0.1 Picard Version: 2.27.5

* Edit: I went back and realized that liftover failed for chr20 as well, so the failure isn't because of a difference in files.

elisabethgoldman commented 5 months ago

Did you end up using a GATK tool to convert the coordinates instead of the UCSC's liftOver tool? Seems the latter has some known issues when there are multiple patches between major genome updates.

rlancaster96 commented 5 months ago

Did you end up using a GATK tool to convert the coordinates instead of the UCSC's liftOver tool? Seems the latter has some known issues when there are multiple patches between major genome updates.

yes! that makes sense. I used gatk's version since I couldn't figure out how to get UCSC liftOver to work with VCFs, and I was having trouble converting VCF to BED. I can try again though. I also saw this note from UCSC Note: It is not recommeneded to use LiftOver to convert SNPs between assemblies, and more information about how to convert SNPs between assemblies can be found on the following FAQ entry.

elisabethgoldman commented 5 months ago

Did you end up using a GATK tool to convert the coordinates instead of the UCSC's liftOver tool? Seems the latter has some known issues when there are multiple patches between major genome updates.

yes! that makes sense. I used gatk's version since I couldn't figure out how to get UCSC liftOver to work with VCFs, and I was having trouble converting VCF to BED. I can try again though. I also saw this note from UCSC Note: It is not recommeneded to use LiftOver to convert SNPs between assemblies, and more information about how to convert SNPs between assemblies can be found on the following FAQ entry.

GATK's version sounds like the right choice. Will check with KE on the necessity of specifically liftingOver and converting VCF to bed this way and report back.

elisabethgoldman commented 5 months ago

Did you end up using a GATK tool to convert the coordinates instead of the UCSC's liftOver tool? Seems the latter has some known issues when there are multiple patches between major genome updates.

yes! that makes sense. I used gatk's version since I couldn't figure out how to get UCSC liftOver to work with VCFs, and I was having trouble converting VCF to BED. I can try again though. I also saw this note from UCSC Note: It is not recommeneded to use LiftOver to convert SNPs between assemblies, and more information about how to convert SNPs between assemblies can be found on the following FAQ entry.

GATK's version sounds like the right choice. Will check with KE on the necessity of specifically liftingOver and converting VCF to bed this way and report back.

It seems we'll have the same SNP-related issue that UCSC explicitly references, regardless of which program is used. @rlancaster96 can you try the UCSC liftOver tool again and see if you can get it to work with VCFs? There are also a few other tools mentioned in this recent paper that could be alternate options: https://www.mdpi.com/2073-4425/14/10/1875 (@kellrott if you have thoughts for/against other tools like those in the paper, please share)

rlancaster96 commented 5 months ago

@elisabethgoldman I tried again with UCSC liftOver, and unfortunately it does not take VCFs

> ./liftOver chr19.truth.vcf hg19ToHg38.over.chain.gz chr19_liftoverhg38.truth.vcf unMapped
Reading liftover chains
Mapping coordinates
Expecting integer field 3 line 106 of /home/groups/CEDAR/lancasru/WGS_COH_NF/make_test_truth/synapse_files/chr19.truth.vcf, got .

I can look into tools to convert VCF to BED while preserving all of the information in the VCF, and I'll also look into the tools used in the paper!

rlancaster96 commented 5 months ago

I successfully created the lifted over VCFs using Ensembl's CrossMap (at least, it produced files with entries as expected). Not sure how to test how accurate it is, or if it works for our needs. chr19 - 2 unmapped after liftover (1738/1740 variants mapped) chr20 - 0 unmapped after liftover (1528/1528 variants mapped)

If needed I could still look into converting the VCF files to BED format to use UCSC's liftover, and for that I would have to either: 1) explore bedops bed to vcf as an option or 2) manually convert vcf to bed using a python script, since we have indels that require some end position adjustment. Those are both totally options and if still needed I can look into!