documentation help - Githubissues

brentp commented 3 years ago

Hi, thanks for developing this tool. I have several issues which I think may just be documentation issues:

It seems the tool does not accept (b)gzipped VCF's. This is unusual, so it would be nice if the error message was clearer (or if gzip was supported.
It's unclear how to use the output. from #7, it seems maybe one should specify: --output_genotypes ? I would like to merge SVs, then re-genotype all samples at the merged SVs. do I need --output_genotypes for that? Or should I just sort the output from jasmin without --output_genotypes ?
(this seems to be a real bug, not doc issue). the output includes a PRECISE flag that is not included in the header (though IMPRECISE is). This causes problems with BCF output.

thanks for any help. -Brent

mkirsche commented 3 years ago

Hi Brent,

Thanks for your interest in Jasmine!

For (1), Jasmine does not currently accept zipped VCFs. I have updated some of the error messages and the usage menu to help make this more clear.

For (2), there are two main ways to use the output:

The SUPP_VEC and IDLIST fields give the presence/absence of each merged variant across all samples, and give the IDs in the original VCFs of the variants corresponding to that output. This is useful if you want to access the original VCF entries, but does not give genotype information without tracing back to the original files.
If you are interested in the genotype in each individual, you can use --output_genotypes, which simply copies the genotypes from the original unmerged variant calls and writes them to the merged file. It does not change the genotype information or attempt any kind of re-genotyping.

For (3), that is indeed a bug and causes issues when the input VCFs don't already include those flags. I have updated the code to add them to the output VCF header in that case. This and the other change are in release 1.1.2.

Please let me know if you have any more questions! Melanie

brentp commented 3 years ago

Excellent! Thanks for rapid response.

brentp commented 3 years ago

Another nitpick is that it would be nice if AVG_START et al were Float instead of String and had a bit less precision.

brentp commented 3 years ago

Also, the new message is a warning and the Jasmine terminates, but with a 0 exit code. Would be nice if it had non-zero exit code so pipelines can detect the error sooner.

mkirsche commented 3 years ago

Thanks, I really appreciate the suggestions and completely agree! I'll incorporate those into the next release in the coming weeks!

Melanie

jejonesu commented 3 years ago

Hi Melanie,

I had a question regarding —output_genotypes. I would like to look at the read depth for the area of the variant calls to ensure that the samples that don’t have variants called are actually just reference calls. Currently those values are NA, since —output_genotypes is just copying down the genotypes from the original vcfs. Do you have any plans to implement a way to reference the bam files for the samples to be able to see those read depths and differentiate between no calls and reference calls?

Thank you! Jennifer

mkirsche commented 3 years ago

Hi Jennifer,

There are not currently any plans for Jasmine to extract read information from BAM files or attempt any kind of re-genotyping.

However, we do have a strategy, double thresholding, which helps in the case you are describing where you want to distinguish absent calls as either reference calls or not-enough-information:

Call variants with pretty lenient parameters (when calling SVs with sniffles, we use a minimum read support of 2 and a minimum length of 20). The goal is to capture essentially all true SVs with high sensitivity at the risk of making some false positive calls.
Pass the --mark_specific, spec_reads, and spec_len parameters to Jasmine, which causes it to label high-confidence calls which fit a stricter set of requirements (we typically use a length threshold of at least 30 and a read support threshold of either 10 or 25% of the average coverage available, whichever is smaller). The variants which get marked with the INFO field IS_SPECIFIC=1 represent a set of confident SV calls with a focus on high specificity.
After merging these marked calls, each variant in Jasmine's merged population will be marked with IS_SPECIFIC=1 if at least one of the samples had a high-confidence call there, and marked with IS_SPECIFIC=0 otherwise.
Filter out any calls marked with IS_SPECIFIC=0 from the final callset because that means there was no high-confidence support for that call in any sample, indicating a likely false positive.
For any remaining calls, if they are labelled as absent from a given sample, there wasn't even weaker support for the variant being present, so that gives you more confidence that it is a reference call.

However, this approach is largely specialized for variants calls from Sniffles, and that's because that caller provides the number of reads supporting each variant call as an INFO field in the produced VCF file. I imagine a similar strategy would work for other callers as well depending on what information they report, but it may require writing a script on your end to add the IS_SPECIFIC INFO fields to the per-sample VCFs prior to running Jasmine.

I hope that helps, and please let me know if you have any further questions! Melanie

mkirsche / Jasmine

documentation help #19