mkirsche / Jasmine

Jasmine: SV Merging Across Samples
MIT License
174 stars 16 forks source link

documentation help #19

Closed brentp closed 3 years ago

brentp commented 3 years ago

Hi, thanks for developing this tool. I have several issues which I think may just be documentation issues:

  1. It seems the tool does not accept (b)gzipped VCF's. This is unusual, so it would be nice if the error message was clearer (or if gzip was supported.

  2. It's unclear how to use the output. from #7, it seems maybe one should specify: --output_genotypes ? I would like to merge SVs, then re-genotype all samples at the merged SVs. do I need --output_genotypes for that? Or should I just sort the output from jasmin without --output_genotypes ?

  3. (this seems to be a real bug, not doc issue). the output includes a PRECISE flag that is not included in the header (though IMPRECISE is). This causes problems with BCF output.

thanks for any help. -Brent

mkirsche commented 3 years ago

Hi Brent,

Thanks for your interest in Jasmine!

For (1), Jasmine does not currently accept zipped VCFs. I have updated some of the error messages and the usage menu to help make this more clear.

For (2), there are two main ways to use the output:

For (3), that is indeed a bug and causes issues when the input VCFs don't already include those flags. I have updated the code to add them to the output VCF header in that case. This and the other change are in release 1.1.2.

Please let me know if you have any more questions! Melanie

brentp commented 3 years ago

Excellent! Thanks for rapid response.

brentp commented 3 years ago

Another nitpick is that it would be nice if AVG_START et al were Float instead of String and had a bit less precision.

brentp commented 3 years ago

Also, the new message is a warning and the Jasmine terminates, but with a 0 exit code. Would be nice if it had non-zero exit code so pipelines can detect the error sooner.

mkirsche commented 3 years ago

Thanks, I really appreciate the suggestions and completely agree! I'll incorporate those into the next release in the coming weeks!

Melanie

jejonesu commented 3 years ago

Hi Melanie,

I had a question regarding —output_genotypes. I would like to look at the read depth for the area of the variant calls to ensure that the samples that don’t have variants called are actually just reference calls. Currently those values are NA, since —output_genotypes is just copying down the genotypes from the original vcfs. Do you have any plans to implement a way to reference the bam files for the samples to be able to see those read depths and differentiate between no calls and reference calls?

Thank you! Jennifer

mkirsche commented 3 years ago

Hi Jennifer,

There are not currently any plans for Jasmine to extract read information from BAM files or attempt any kind of re-genotyping.

However, we do have a strategy, double thresholding, which helps in the case you are describing where you want to distinguish absent calls as either reference calls or not-enough-information:

  1. Call variants with pretty lenient parameters (when calling SVs with sniffles, we use a minimum read support of 2 and a minimum length of 20). The goal is to capture essentially all true SVs with high sensitivity at the risk of making some false positive calls.
  2. Pass the --mark_specific, spec_reads, and spec_len parameters to Jasmine, which causes it to label high-confidence calls which fit a stricter set of requirements (we typically use a length threshold of at least 30 and a read support threshold of either 10 or 25% of the average coverage available, whichever is smaller). The variants which get marked with the INFO field IS_SPECIFIC=1 represent a set of confident SV calls with a focus on high specificity.
  3. After merging these marked calls, each variant in Jasmine's merged population will be marked with IS_SPECIFIC=1 if at least one of the samples had a high-confidence call there, and marked with IS_SPECIFIC=0 otherwise.
  4. Filter out any calls marked with IS_SPECIFIC=0 from the final callset because that means there was no high-confidence support for that call in any sample, indicating a likely false positive.
  5. For any remaining calls, if they are labelled as absent from a given sample, there wasn't even weaker support for the variant being present, so that gives you more confidence that it is a reference call.

However, this approach is largely specialized for variants calls from Sniffles, and that's because that caller provides the number of reads supporting each variant call as an INFO field in the produced VCF file. I imagine a similar strategy would work for other callers as well depending on what information they report, but it may require writing a script on your end to add the IS_SPECIFIC INFO fields to the per-sample VCFs prior to running Jasmine.

I hope that helps, and please let me know if you have any further questions! Melanie