nmdp-bioinformatics / phycus

Service used for curation of Haplotype Frequency
GNU Lesser General Public License v3.0
7 stars 23 forks source link

Quality Tags #79

Open mmaiers-nmdp opened 6 years ago

mmaiers-nmdp commented 6 years ago

Categorize the list (HH2016) further. Some are just "descriptive statistics" or "features". Some are indicators of how "good" the data is.

At DaSH8 we should implement a few of these using "AWS Lambda"

Simple examples: RES_MISS_LOCI - depends on GT Wn Statistic - global 2-locus pairwise LD (depends on GT) DIV_50_REL - depends on HT only

fscheel commented 6 years ago

The following mechanisms must be part of the implementation:

  1. have a central place to plug in further metrics calculations during upload
  2. compute new metrics on existing datasets without service disruption
fscheel commented 6 years ago

This is the list of defined "quality" metrics in PHYCUS right now:

          - DIV_LAMBDA
          - DIV_50
          - DIV_50_REL
          - SAM_SIZE
          - SAM_POP
          - DIV_PGD
          - DIV_HEAVY_TAIL
          - RES_TRS_COUNT
          - RES_TRS
          - RES_SHARE_AMBIG
          - RES_MISS_LOCI
          - DEV_HWE
          - ERR_STD
          - ERR_SAMP_80_100
          - SUM_FREQ_GAP
          - ERR_OFFSET
          - LD_MEASURE
          - KFOLD_IMPUTE
          - KFOLD_PRED_ACTUAL
          - KFOLD_N

It seems that we can calulate some ourselves. But right now we also accept all of these values. How do we handle values that we receive but also calculate ourselves? Some options:

hpeberhard commented 6 years ago

I talked with Florian about these today and I would like to share a few thoughts with you. 1) I believe it is important to come to a minimum valuable product / minimum loveable product soon. 2) Loveable includes not perfect ;-). 3) Hackers at Hackathons should have something to do that has the potential to lead to a minimum * product. 4) Take one of each metrics to start: no GT needed, sample size needed, GT needed, e.g. DIV_50(_REL), DIV_PGD, RES_MISS_LOCI. 5) I would opt for "verify them and return a warning if the do not match. However, I can life with any other option you chose too.

sauter commented 6 years ago

I would suggest to go through the list and classify the metrics according to inputs needed for their calculation. for those metrics that can be calculated on the fly via upload the service should compute them and neither expect nor accept user input.

mmaiers-nmdp commented 6 years ago

We (@mmaiers-nmdp & @fscheel) reviewed the list of quality metrics and most of the them should be computed by the server and not accepted by the client. The spec does have a place for them in the hfc submission.

We see 4 options for how to deal with quality list values submitted by the client 1.The server can’t accept it because there is no place for it (-) too much work to have two different structures/versions of the code; danger that someone changes one but not the other 2.The server WONT accept it - return error, does not persist data (-) too strict 3.The server will SILENTLY ignore it? persist HF data, but not quality list for qualities that we want the server to compute (+) easy; if someone complains, then change it 4.The server will ignore quality list and return a WARNING (-) nobody will read it; over-engineered