mkirsche / Jasmine

Jasmine: SV Merging Across Samples
MIT License
175 stars 16 forks source link

Keep singletons in the merged file #14

Closed mariesaitou closed 3 years ago

mariesaitou commented 3 years ago

Hi. I want to have a VCF file with multiple individuals from individually genotyped each sample. However, as I genotyped them separately, loci with 0|0 genotypes (in reality) in an individual are not recorded in the original vcf file.

Can I incorporate singletons, polymorphic regions with 0|0 genotypes in some individuals? (not all individuals, of course). I guess 0|0 locus (in reality) could be recorded as ./. (in merged VCF), but I want to convert (or re-genotype, restore) ./. into 0|0 if they are likely to be 0|0, not low-quality locus (not genotyped) in that individual? The individual vcf files have quality information (as below). I also have BAM files and ref.fa.

I thought "--mark_specific" does what I want to do, but the following command did not work as I imagined.

The command I used is: jasmine file_list=$vcf_list out_file=./jasmine_six.vcf genome_file=$genome_file bam_list=$bam_list threads=15 --output_genotypes --normalize_type --dup_to_ins --mark_specific vcftools --vcf jasmine_six.vcf --singletons --out test

It gave me an empty file. CHROM POS SINGLETON/DOUBLETON ALLELE INDV

I did not see any unique variants in my manual inspection as well, though I see some loci with "0/0" in every individual.

The output file is:

ssa01 181295 0_64 N TCCTGCTACTATAAATATCATAGCTGGTATAATAGCCGCT . PASS IMPRECISE;SVMETHOD=JASMINE;CHR2=ssa01;END=181295;STD_quant_start=13.509256;STD_quant_stop=12.509996;Kurtosis_quant_start=-1.994528;Kurtosis_quant_stop=-1.993620;SVTYPE=INS;RNAMES=c24ce49b-c47c-46a0-b195-7a5507b2a4f3,cf4506b6-c1c0-401f-8acd-e22bee0c5659;SUPTYPE=AL;SVLEN=39;STRANDS=+-;RE=2;REF_strand=1,0;AF=0.666667;CONFLICT=1;OLDTYPE=INS;IS_SPECIFIC=0;STARTVARIANCE=0.000000;ENDVARIANCE=0.000000;AVG_LEN=39.000000;AVG_START=181295.000000;AVG_END=181334.000000;SUPP_VEC_EXT=111111;IDLIST_EXT=64,64,64,64,64,64;SUPP_EXT=6;SUPP_VEC=111111;SUPP=6;IDLIST=64,64,64,64,64,64;REFINEDALT=. GT:IS:OT:OS:DV:DR 0/1:0:INS:.:2:1 0/1:0:INS:.:2:1 0/1:0:INS:.:2:1 0/1:0:INS:.:2:1 0/1:0:INS:.:2:1 0/1:0:INS:.:2:1 ssa01 181496 0_65 N . PASS IMPRECISE;SVMETHOD=JASMINE;CHR2=ssa01;END=216640;STD_quant_start=9.513149;STD_quant_stop=60.502066;Kurtosis_quant_start=-1.988981;Kurtosis_quant_stop=-1.999727;SVTYPE=INV;RNAMES=a09a4a83-767d-4474-8930-5f7893a57e1b,f653e6a1-65f2-48c6-8c6e-72dc654e9adf;SUPTYPE=SR;SVLEN=35144;STRANDS=--;RE=2;REF_strand=27,27;AF=0.0357143;OLDTYPE=INV;IS_SPECIFIC=0;STARTVARIANCE=0.000000;ENDVARIANCE=0.000000;AVG_LEN=35144.000000;AVG_START=181496.000000;AVG_END=216640.000000;SUPP_VEC_EXT=111111;IDLIST_EXT=65,65,65,65,65,65;SUPP_EXT=6;SUPP_VEC=111111;SUPP=6;IDLIST=65,65,65,65,65,65;REFINEDALT=. GT:IS:OT:OS:DV:DR 0/0:0:INV:.:2:54 0/0:0:INV:.:2:54 0/0:0:INV:.:2:54 0/0:0:INV:.:2:54 0/0:0:INV:.:2:54 0/0:0:INV:.:2:54

Thank you very much!

mkirsche commented 3 years ago

Hi,

Thanks for interest in using Jasmine! I have added an option --default_zero_genotype which will cause samples with no variant calls to be marked as 0|0 by default instead of ./. It has been added to the Gtithub and will make it into conda during the next release today or tomorrow. Hopefully that helps it to be more compatible with the vcftools singleton option.

As for your other question, unfortunately Jasmine does not support genotyping or making changes to existing genotypes, Its purpose is more to match up the existing variant calls to account for differences in breakpoints, and so it works on solely a presence/absence basis and relies on the SV caller's genotype annotations.

The mark_specific flag you mentioned is used for double thresholding, where we call SVs with a lenient threshold (in terms of the length and the number of reads) to get a sensitive callset and then annotate variants that meet a stricter threshold. The main purpose of this is to allow us to keep low-confidence calls in places where we have high-confidence calls in at least one other sample.

I hope that helps answer your questions!

Melanie

mariesaitou commented 3 years ago

Fantastic! Thank you very much!

mariesaitou commented 3 years ago

Hi, I tried to run --default_zero_genotype but could not find it in the manual. How can I add the option? I ran jasmine file_list=$vcf_list out_file=./jasmine_ten.vcf genome_file=$genome_file bam_list=$bam_list threads=15 --output_genotypes --normalize_type --run_iris --dup_to_ins --default_zero_genotype and got

Exception in thread "main" java.lang.Exception: samtools faidx did not produce an output: samtools faidx selected.fa chr1:132539-133059
    at GenomeQuery.genomeSubstring(GenomeQuery.java:60)
    at DuplicationsToInsertions.convertFile(DuplicationsToInsertions.java:76)
    at PipelineManager.convertDuplicationsToInsertions(PipelineManager.java:32)
    at Main.preprocess(Main.java:35)
    at Main.main(Main.java:17)

It can be an issue in my environment.
I appreciate your expertise.

mkirsche commented 3 years ago

Hi,

You are using the --default_zero_genotype flag correctly - sorry that it hasn't been added to the manual yet! In the meantime, the most updated list of parameter options can be viewed by running Jasmine with no parameters.

As for the error you're getting, it seems to be an issue when running samtools to convert duplications to insertions as part of the pre-processing. Could you please let me know what you get when you run samtools faidx selected.fa chr1:132539-133059 on its own outside of Jasmine? It's possible that the variant coordinates are outside of the genome bounds somehow, but it could also be a problem with the samtools installation.

Thanks! Melanie

mariesaitou commented 3 years ago

Thank you very much; I will play around with it. It worked in the previous version (no --genotype... option)

samtools faidx fasta.fa chr1:132539-133059

chr1:132539-133059 GGATTGATGCATCAGTACGTGA........

There seemed to be a sequence, but I also remember a similar incident in the downstream analysis; Inter-chromosomal "Translocations" could not be indexed by Tabix, because occasionally "Ending position is smaller than starting position"

I'll investigate it more. Thank you very much.