googlegenomics / gcp-variant-transforms

GCP Variant Transforms
Apache License 2.0
135 stars 55 forks source link

Support merging from different variant callers #113

Open arostamianfar opened 6 years ago

arostamianfar commented 6 years ago

While debugging a user issue (see thread), we came across non-standard fields from VarScan2 (e.g. the "AD" field is split into "AD" and "RD" fields, each having Number=1 instead of a single field with Number=R as specified in the VCF spec).

While the ideal solution is to change VarScan2 to output the AD field according to the VCF spec, this approach may not be ideal and would not work for existing VCF files.

Idea: provide a "transformation plugin" for converting output from non-standard variant callers (e.g. in the VarScan2 case, it would merge AD and RD fields into a single field with Number=R). We can write a few common plugins ourselves, but it should be easy enough for users to write their own plugins as well (either in code or through a config file), since we may not be able to cover all corner cases.

slagelwa commented 6 years ago

Just encountered this ourselves.

arostamianfar commented 6 years ago

ah, thanks for the feedback! Is your case the same as VarScan2 or is it from a different variant caller?

slagelwa commented 6 years ago

sigh...varscan2

On Thu, Aug 2, 2018 at 11:39 AM Asha Rostamianfar notifications@github.com wrote:

ah, thanks for the feedback! Is your case the same as VarScan2 or is it from a different variant caller?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/googlegenomics/gcp-variant-transforms/issues/113#issuecomment-410027241, or mute the thread https://github.com/notifications/unsubscribe-auth/AKBQyPZKh5IWVI_nMatJNhD6jzj5r8Ufks5uM0dzgaJpZM4SIsQY .

slagelwa commented 6 years ago

And to add -- the vcf files we are trying to load contain multiple variant callers merged using bcftools.

On Thu, Aug 2, 2018 at 11:55 AM Joe Slagel slagelwa@gmail.com wrote:

sigh...varscan2

On Thu, Aug 2, 2018 at 11:39 AM Asha Rostamianfar < notifications@github.com> wrote:

ah, thanks for the feedback! Is your case the same as VarScan2 or is it from a different variant caller?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/googlegenomics/gcp-variant-transforms/issues/113#issuecomment-410027241, or mute the thread https://github.com/notifications/unsubscribe-auth/AKBQyPZKh5IWVI_nMatJNhD6jzj5r8Ufks5uM0dzgaJpZM4SIsQY .

slagelwa commented 6 years ago

While it may be a artifact of our VCF files, I observed records where the "RD" field also contained multiple values. Considering RD is suppose to be the depth of reference-supporting bases, and we only have one reference, I'm not certain why we'd encounter this.