mkirsche / Jasmine

Jasmine: SV Merging Across Samples
MIT License
174 stars 16 forks source link

Jasmine Merging of Samples called by Manta and Smoove followed by multi-sample merging. #61

Open Lukecassar21 opened 3 months ago

Lukecassar21 commented 3 months ago

I am currently trying to call a cohort of a bit over 1000 samples using Manta and Smoove as my SV-callers. I wish to know what the best approach would be to merge my structural variant calls using Jasmine.

When using SURVIVOR, the merging is relatively straightforward, first merge the different caller outputs per sample (you get the genotypes for that variant from each caller to show if it was detected by a caller and if it was what genotype it gave the varian), followed by merging these caller-merged VCF files by sample. The result is a VCF file with 1 genotype per sample (usually the "best" genotype from the callers in the previous VCF files, the "best" being preferably 1/1 followed by 0/1).

In Jasmine, I've been trying to replicate a similar approach by first merging caller outputs per sample with --allow_intrasample to allow for merging of overlapping calls in between callers and no --output_genotypes option, as doing this causes problems later on which I will explain.

Following the intra-sample merging, I try to merge by disabling --allow_intrasample and enabling --output_genotypes. The result looks similar to what I would expect, one entry per sample, where the genotype per sample is taken from the 1 genotype present in the intrasample merged vcf files. The reason why I disabled --output_genotypes in the intrasample merge is because whenever I enabled it and proceeded to the intersample merging, the result would be duplicated samples being present in the VCF file as separate samples, for example (0_Sample1, 1_Sample1, 0_Sample2, 1_Sample2) which muddies my data.

I also tried merging all intra and inter-sample VCF files at once and got a similar result. I understand this was recommended in #16 , although I want my final merged VCF file to only have 1 of each sample, where the best genotype is selected for that sample from among the caller outputs. This is just to make downstream analysis easier as having duplicated sample names may complicate things down the line.

So I'm not quite sure what the best way to go about this is. I'd like to use JasmineSV over SURVIVOR due to the fact that it preserves the original variant information present in the original caller VCF files (such as the alternate allele/sequence given by Manta) and can also preserve 0/0 calls, unlike SURVIVOR.