mkirsche / Jasmine

Jasmine: SV Merging Across Samples
MIT License
174 stars 16 forks source link

Altered genotype?/handling translocations #17

Closed mariesaitou closed 3 years ago

mariesaitou commented 3 years ago

Hi, thank you very much for developing a nice tool. I have two questions regarding Jasmine. I used Bam -> Sniffle -> Jasmine to obtain a master VCF file with multiple individuals.

(1) Altered genotype? This locus looks like all heterozygous in six individuals after Jasmine VCF., but in IGV of the Bam files. It looks like del/del in one and ref/ref in another. How can we interpret it? Am I misinterpreting the VCF result? I first asked the Sniffle group, but they answered that Jasmine can alter genotypes - do you have any idea how it happens?

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  LLsal   Barry   tanner  Bond    Klopp   Brian
ssa01   139041651   0_24650 N   <DEL>   .   PASS    PRECISE;SVMETHOD=JASMINE;CHR2=ssa01;END=139043334;
0|1:0:DEL:.:8:2 0|1:0:DEL:.:8:2 0|1:0:DEL:.:8:2 0|1:0:DEL:.:8:2 0|1:0:DEL:.:8:2 0|1:0:DEL:.:8:2
#ssa01  139041651   0_24650 N   <DEL>   .   PASS    PRECISE;SVMETHOD=JASMINE;CHR2=ssa01;END=139043334;STD_quant_start=3.33542;STD_quant_stop=8.2991;Kurtosis_quant_start=1.31966;Kurtosis_quant_stop=2.30177;SVTYPE=DEL;RNAMES=4250ef10-dff4-4c5c-9c75-8ad085dcf7a9,43cd9070-cfef-48e3-af31-9ab39d2b93e7,49e571b9-1c82-405c-907e-f019f88f37de,8ba13242-89a8-46b0-8bbe-9ded59f71356,9674455c-0317-4798-9419-cb9e3c6db356,969e1d1e-d696-4d44-8c07-892fbfa14bd7,f6d1babd-83a1-4876-859b-0bff91cad0ce,f7a73e18-1dd5-4032-ac7e-55f61225e5c8;SUPTYPE=SR;SVLEN=-1683;STRANDS=+-;RE=8;REF_strand=1,1;AF=0.8;CONFLICT=0;OLDTYPE=DEL;IS_SPECIFIC=0;STARTVARIANCE=-4.000000;ENDVARIANCE=0.000000;AVG_LEN=-1683.000000;AVG_START=139041651.000000;AVG_END=139043334.000000;SUPP_VEC_EXT=111111;IDLIST_EXT=24650,24650,24650,24650,24650,24650;SUPP_EXT=6;SUPP_VEC=111111;SUPP=6;IDLIST=24650,24650,24650,24650,24650,24650;REFINEDALT=. GT:IS:OT:OS:DV:DR   0|1:0:DEL:.:8:2 0|1:0:DEL:.:8:2 0|1:0:DEL:.:8:2 0|1:0:DEL:.:8:2 0|1:0:DEL:.:8:2 0|1:0:DEL:.:8:2

image

(2) Translocations

Translocations cannot be indexed by Tabix, because the described “ending position" is smaller than the starting position. Currently, I remove translocations from the VCF file. If you are nicely handling translocations with Tabix, please let me know how to do so.

(base) [mariesai@cn-1 Jasmine]$ tabix -p vcf  jasmine_six_phased.head.vcf.gz
[E::hts_idx_push] Invalid record on sequence #1: end 1803791 < begin 28704346
ssa01   28704346    0_512026    .   <TRA>   .SVMETHOD=JASMINE;SVTYPE=TRA;CHR2=ssa29;END=1803791;

Thank you very much for your help!

mkirsche commented 3 years ago

Hi,

As for your first question, Jasmine does not alter genotypes; it just copies them from the input VCFs. What does the genotype look like in the individual sample's VCF file? If it's different than what Jasmine is reporting, that would indicate a bug in Jasmine, but if it is also heterozygous that indicates Sniffles is mis-genotyping it.

For your second question, I have not found a way to handle translocations when using tabix. I typically remove them as well.

Best, Melanie

mariesaitou commented 3 years ago

Hi, thank you very much for your comment!

Attached are the original and merged vcf files (VCF was rejected by Github! So I converted them to txt.)

sixindivi.head.txt Bond.head.txt Brian.head.txt Klopp.head.txt tanner.head.txt Barry.head.txt LLsal.head.txt

For example, at ssa01:132539, the genotype is 1/1 in the original tanner file, but the same locus is 1|0 in the merged vcf. But, perhaps Jasmine thought 1|0 was more plausible? Please correct me if I am wrong - it is likely that I am misinterpreting something/making a stupid mistake.

The command I ran was : jasmine --output_genotypes --normalize_type --dup_to_ins --run_iris iris_args=--keep_long_variants --default_zero_genotype So, "--ignore_strand" may be related. I think I used the Jasmine version in April or May so possibly slightly older than the current version.

Now I am running: jasmine --output_genotypes --normalize_type --dup_to_ins --run_iris iris_args=--keep_long_variants --default_zero_genotype --ignore_strand with the Jasmine v1.11.

mkirsche commented 3 years ago

Hi,

It looks like based on the IDLIST field that the merged entry you are looking at includes the entries with ID 7 from all six input VCFs. This doesn't really make sense based on what those entries look like, so I'm thinking that the merged VCF is somehow the Brian VCF merged with itself six times since it's the only sample with a heterozygous duplication with an ID of 7 at or near that position. What are the contents of the file you are passing to Jasmine as the file_list parameter?

Relatedly, when merging SVs, the merged variant's position is taken from the first input file, and does not necessarily match the positions in the other entries that got merged with it. The IDLIST INFO field parameter can help with identifying which entries were merged to form each output variant.

Melanie

mariesaitou commented 3 years ago

Oh, well... it is possible Let me see...

The input files are

/mnt/SCRATCH/princesstest/result420_to_simon/LLsal/result/minimap.SVs.phased.vcf /mnt/SCRATCH/princesstest/result420_to_simon/Barry/result/minimap.SVs.phased.vcf /mnt/SCRATCH/princesstest/result420_to_simon/tanner/result/minimap.SVs.phased.vcf /mnt/SCRATCH/princesstest/result420_to_simon/Bond/result/minimap.SVs.phased.vcf /mnt/SCRATCH/princesstest/result420_to_simon/Klopp/result/minimap.SVs.phased.vcf /mnt/SCRATCH/princesstest/result420_to_simon/Brian/result/minimap.SVs.phased.vcf

This is the head of output file:

Number of duplications converted to insertions: 5041 out of 506922 total variants Number of duplications converted to insertions: 2912 out of 360764 total variants Number of duplications converted to insertions: 2972 out of 475576 total variants Number of duplications converted to insertions: 5370 out of 555795 total variants Number of duplications converted to insertions: 3576 out of 600744 total variants Number of duplications converted to insertions: 4728 out of 528227 total variants /net/cn-1/mnt/SCRATCH/princesstest/result420/Jasmine/output/minimap.SVs.phased_dupToIns_markedSpec.vcf has 528227 variants /net/cn-1/mnt/SCRATCH/princesstest/result420/Jasmine/output/minimap.SVs.phased_dupToIns_markedSpec.vcf has 528227 variants /net/cn-1/mnt/SCRATCH/princesstest/result420/Jasmine/output/minimap.SVs.phased_dupToIns_markedSpec.vcf has 528227 variants /net/cn-1/mnt/SCRATCH/princesstest/result420/Jasmine/output/minimap.SVs.phased_dupToIns_markedSpec.vcf has 528227 variants /net/cn-1/mnt/SCRATCH/princesstest/result420/Jasmine/output/minimap.SVs.phased_dupToIns_markedSpec.vcf has 528227 variants /net/cn-1/mnt/SCRATCH/princesstest/result420/Jasmine/output/minimap.SVs.phased_dupToIns_markedSpec.vcf has 528227 variants ... Number of sets with multiple variants: 528227 Number of insertions converted back to duplications: 4909 out of 528227 total variants

So, seemingly, the first six lines of the output imply that Jasmine read six different files. But, just after that, did Jasmine start only analyzing Brian (the last sample)?

mkirsche commented 3 years ago

Hi,

Thanks for the detailed information! It is indeed a bug in Jasmine - in the preprocessing step of converting duplications to insertions, it writes the updated files to its output directory but preserves the old filenames. Since all of the base filenames are the same, it writes them all to the same file and they overwrite each other. I'll work on a fix for it, but in the meantime, it should work correctly if you rename the input files to be different from each other (e.g., LLsal_minimap.SVs.phased.vcf and so on).

Sorry for the trouble, and I hope that fixes it! Melanie

mariesaitou commented 3 years ago

Oh, that makes sense!!! Thank you very much!

mkirsche commented 3 years ago

The issue has been fixed as of the latest commit: https://github.com/mkirsche/Jasmine/commit/b0ca6a3feb82993cc4128efe477cf71bc47b1885

It will probably take a few days for the new release v1.1.1 to make it to bioconda though, so for now you could either rename the files as I mentioned before, or download Jasmine from source.

Thanks again for pointing this out! Melanie

mariesaitou commented 3 years ago

Sorry again - Looks like the new Jasmine does not recognize Samtools (at least Samtools is ready in my base environment). Is this a new phenomenon? Or just specific to my environment?

(jasmine) [mariesai@login jasmine629]$ jasmine file_list=vcf.list out_file=/jasmine11.629.vcf genome_file=CHR_selected.fa bam_list=bam.list threads=60 --output_genotypes --normalize_type --dup_to_ins --run_iris iris_args=--keep_long_variants --default_zero_genotype --ignore_strand

Exception in thread "main" java.io.IOException: Cannot run program "samtools": error=2, No such file or directory at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128) at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071) at java.base/java.lang.Runtime.exec(Runtime.java:592) at java.base/java.lang.Runtime.exec(Runtime.java:416) at java.base/java.lang.Runtime.exec(Runtime.java:313) at GenomeQuery.testSamtoolsInstalled(GenomeQuery.java:32) at GenomeQuery.(GenomeQuery.java:17) at DuplicationsToInsertions.convertFile(DuplicationsToInsertions.java:40) at PipelineManager.convertDuplicationsToInsertions(PipelineManager.java:41) at Main.preprocess(Main.java:35) at Main.main(Main.java:17) Caused by: java.io.IOException: error=2, No such file or directory at java.base/java.lang.ProcessImpl.forkAndExec(Native Method) at java.base/java.lang.ProcessImpl.(ProcessImpl.java:340) at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:271) at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107) ... 10 more

mkirsche commented 3 years ago

Thank you for bringing this to my attention! It looks like I hadn't added samtools or Iris as dependencies for Jasmine in bioconda. That's in the process of being added now, but in the meantime running conda install irissv while in that environment should install everything you need.

Melanie