precisely / web

1 stars 0 forks source link

Variant Call validation on `end` fails #354

Closed gcv closed 5 years ago

gcv commented 5 years ago

Failing input:

  {
    "data": {
      "filter": ".",
      "start": 28564346,
      "imputed": false,
      "altBases": [
        "A"
      ],
      "refBases": "T",
      "genotype": [],
      "refName": "chr17",
      "refVersion": "37p13",
      "sampleSource": "23andme",
      "userId": "auth0-5be62c27ec312320f5625d11",
      "sampleId": "bc28802f9fafc0cec0f457dd87834ec5f8e69ba4728100690802b23c5b451e23"
    }
  }

Error:

ValidationError: child \"end\" fails because [\"end\" contains an invalid value] on cv-variant-call

This doesn't make sense because variant-call/models.ts does not say end is required, and no end argument is passed in. It just says end: Joi.number().min(Joi.ref('start')), which suggests it just needs to be at least as large as start.

This variant is then imputed, which seems to validate just fine:

  {
    "data": {
      "filter": "PASS",
      "start": 28564346,
      "imputed": true,
      "altBases": [
        "C"
      ],
      "genotypeLikelihood": [
        1,
        0,
        0
      ],
      "refBases": "T",
      "genotype": [
        0,
        0
      ],
      "refName": "chr17",
      "refVersion": "37p13",
      "sampleSource": "23andme",
      "userId": "auth0-5be62c27ec312320f5625d11",
      "sampleId": "bc28802f9fafc0cec0f457dd87834ec5f8e69ba4728100690802b23c5b451e23",
      "createdAt": "2018-11-10T13:48:07.967Z",
      "end": 28564347,
      "variantId": "chr17:37p13:28564346:28564347:23andme:bc28802f9fafc0cec0f457dd87834ec5f8e69ba4728100690802b23c5b451e23",
      "zygosity": "homozygous",
      "accession": "NC_000017.10"
    }
  }
gcv commented 5 years ago

For @aneilbaboo: Change the validation so that end is optional if readFail (a new data point) is true (it should default to false).

For @gcv: Change the bioinformatics validation to look for filter set to ., and when that happens, set readFail to true.

aneilbaboo commented 5 years ago

I removed end from VariantCall.

I looked more closely at VCF and determined that "end" is not required and the way we were using it is ambiguous. As it turns out, the userId plus chromosome:version:start:sampleType:sampleId is sufficient to uniquely identify a variantCall. I'm dropping "end".

The reason for this is that VCF represents actual "calls" of variants, not raw potentially overlapping sequence data. So the software that produces the VCF has resolved overlaps and presents a single call at a position.

aneilbaboo commented 5 years ago

@gcv - my changes require a small change in the bioinformatics code base.

The genotypeLikelihood param is now pluralized: genotypeLikelihoods. It is an array of numbers of a defined length. The length, it turns out, is dependent on the number of altBases. I wrote up some documentation about in a code comment in a PR I am about to push. We can discuss.

Apart from changing genotypeLikelihood => genotypeLikelihoods, you shouldn't have to do anything else.

aneilbaboo commented 5 years ago

Excuse me - there are three changes you'll need to make, the others being injecting a couple of booleans: readFail and imputed.

I'll put these in a separate issue.