mkirsche / Jasmine

Jasmine: SV Merging Across Samples
MIT License
175 stars 16 forks source link

Genotype columns #27

Closed carlschreck closed 2 years ago

carlschreck commented 2 years ago

Hello,

We are analyzing PacBio vcf files merged via Jasmine, and were wondering how the genotype fields (columns 9 & 10) are determined from the original files. Do you know if the genotype columns are taken from one of the original vcf files, or is there any special handling to average the values?

I am asking because PacBio PBSV vcfs have SVTYPE=cnv entries, and we are interested in the copy number information from the genotype columns. By spot-checking a couple of examples, it seems that these columns are taken from the first matching vcf. For example, the following calls from two vcfs

Call from first vcf: chr9 95697579 pbsv.CNV.1043 G . PASS SVTYPE=cnv;END=95703612;SVLEN=6033 CN 7

call from second vcf: chr9 95697579 pbsv.CNV.1128 G . PASS SVTYPE=cnv;END=95703634;SVLEN=6055 CN 4

are merged by Jasmine (along with 500 vcfs without matching calls) and has a copy number identical to the first vcf (CN=7):

chr9 95697579 106_pbsv.CNV.1044 G . PASS SVTYPE=cnv;END=95703612;SVLEN=6033;STARTVARIANCE=0.000000;ENDVARIANCE=122.000000;AVG_LEN=6044.000000;AVG_START=95697579.000000;AVG_END=95703623.000000;SUPP_VEC_EXT=0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000;IDLIST_EXT=pbsv.CNV.1044,pbsv.CNV.1128;SUPP_EXT=2;SUPP_VEC=0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000;SUPP=2;SVMETHOD=JASMINE;IDLIST=pbsv.CNV.1044,pbsv.CNV.1128 CN 7

Thanks in advance for your response!

Best, Carl

mkirsche commented 2 years ago

Hi Carl,

Thank you for your interest in using Jasmine! As you noted, Jasmine by default simply copies the genotype columns from the first matching VCF field (i.e., there is no averaging done or anything like that). However, if you would like to output genotypes for all samples in which a variant is present (again, just copied from the input VCFs), you can do so with the flag --output_genotypes

Hope that helps! Melanie

carlschreck commented 2 years ago

Melanie,

Thanks so much for your quick reply, this is very helpful! That's good to know about the --output_genotypes flag.

Best, Carl