malariagen / ag1000g-bakeoff

Variant calling challenges using sequence data from the Anopheles gambiae 1000 Genomes Project (Ag1000G).
MIT License
0 stars 0 forks source link

DeepVariant challenge 1 submission #2

Open alimanfoo opened 5 years ago

alimanfoo commented 5 years ago

Submission of a DeepVariant callset for challenge 1.

alimanfoo commented 5 years ago

Hi @pichuan, @rpoplin, @cmclean, this issue is intended for sharing information and discussion of the DeepVariant callset for the single cross (29-2).

We're setting up this repo to hold information about the data available from Ag1000G for benchmarking and comparing variant callers. We're thinking to structure it around several "challenges", the first challenge being to generate a SNP and/or small indel callset for cross 29-2. We'll put some documentation up shortly, and ultimately we're thinking to have a structured way to submit one or more callsets for challenge 1, but for now we can just discuss and share links and information via this issue. Although we're going to describe it in terms of "challenges", I think we'd like to keep it as informal as possible, and aim to provide a resource which teams like yourselves can use to evaluate, explore and compare methodologies, rather than being focused on finding a "winner". I.e., the aim is to learn and share knowledge, rather than compete.

For now, if you have one (or more) multi-sample VCFs and you'd like to work together on some exploratory analysis, please feel free to post links here to file locations in GCS.

I'll also copy below some of the email discussion regarding options for merging gVCFs to surface that discussion.

Hope that all sounds good.

cc @hardingnj, @tnguyensanger, @podpearson, @roamato

alimanfoo commented 5 years ago

Some discussion regarding ways to merge gVCFs...

@pichuan:

I will have a multiple-sample VCF in addition to per-sample gVCFs. I'm still not sure what's the best way to merge. So far I've been using GLnexus to do the merging, but still not exactly sure what's the best GLnexus for DeepVariant running on these samples. Here are a few configurations I've tried first:

image

Follow-up comment from me:

FWIW I would have thought that a decent baseline for how to merge would just be to take the union of all variant alleles discovered in any individual. The most important thing would be to merge the QUAL values in a sensible way, so that we can analyse how well calibrated QUAL appears to be as a predictor of Mendelian error.