precisely / gene-panel-curation

A repository for organizing Precise.ly's curation effort, both in terms of content, curation documentation, and issue-tracking.
0 stars 0 forks source link

Create seed genetics data #3

Closed aneilbaboo closed 6 years ago

aneilbaboo commented 6 years ago

We’re going to try to stand up a report with sample user data for the current sprint (by Wednesday 21st).

We’ll need files representing a tiny subset of data that the genetics service will produce, for a handful of mock users. The goal is to provide data to exercise the mechanisms for report generation.

@visheshd ^

taltman commented 6 years ago

@aneilbaboo Almost all of the data from Matty's mock-up report for MTHFR has been distilled into a set of YAML and MarkDown files in the repo. There's some fine-tuning that can happen in terms of how the data is organized, but it should be enough to start having those discussions and getting code to parse them. I will start documenting the format so that Vishesh & Matty can look them over. Will they be able to access the Wiki for this repo?

taltman commented 6 years ago

Ah, I misunderstood. Here is the mock data in GA4GH JSON format: https://github.com/precisely/gene-panel-curation/tree/master/mock-user-data

aneilbaboo commented 6 years ago

These are the input files to the (Genetics) Analytics Service. We need the output of that service: JSON files suitable for loading into the Genetics Service - the yellow highlighted parts here:

screen shot 2018-02-14 at 2 10 16 pm

See also the Database architecture document: https://docs.google.com/document/d/1E31Oted7_QN7bCbjnJN6k1X6eP9b-rFcI-uV8vjsxbg/edit

aneilbaboo commented 6 years ago

Specifically, we need to know what are the SVN variant names for the various states? What are the gene names? How will various situations be represented in the genetics service.

We need JSON that will contain values for the fields in the Genetics model:

[ 
  {
    "user_data_type_id": "{barcode-id-from-akesogen}",
    "gene": "mthfr",
    "source": "akesogen:genotyping",
    "labAnalysisId": "...",  // equivalent to the variantsetId --- identifies a particular reading 
    "variant": "....",     // <--- need your help here
    "createdAt": ...,
    "updatedAt": ...
  },
  { 
     // ... another gene genotype for the same user
   },
   ... etc
]
taltman commented 6 years ago

And here is the first iteration of the report generation JSON input: https://github.com/precisely/gene-panel-curation/blob/master/mock-user-data/report_input/MTHFR_C677T-WT_A1298C-heterozygous.json

It will be completed in the next two days (including the metadata fields described above), following design discussions tomorrow.

aneilbaboo commented 6 years ago

Initial thoughts:

  1. Separate out gene annotation from user genetics
    • seed data needs to be input to the Genetics Service (not input to a report)
    • the genetics service will produce the input to a report
  2. Do we need a separate format and service for storing gene annotation information?
  3. Top level structure should represent a user, not a gene
    • Seed data should be a file that the GAS outputs
    • E.g., GAS takes 23andMe file as input, outputs a file of user's variants
    • E.g., GAS takes Akesongen raw genotype input file, outputs a file of user's variants What other info do we need to store?
  4. For SVN, should we use coding indexes rather than genomic indexes?
    • insulates us from shifts in genomic index
    • provides a way to generate a meaningful name for the genotype if an nickname (like C677T) isn't available @taltman ^
taltman commented 6 years ago

Just updated the report JSON files as per format discussion with @aneilbaboo last week: https://github.com/precisely/gene-panel-curation/tree/master/mock-user-data/report_input

@aneilbaboo Regarding additional genes: GRIK3 doesn't have any phenotypes that I am aware of. Perhaps at this point the dev team can issue tickets against this repo for specific use cases, conditions, or unit tests that they need, and I can create the corresponding JSON files for them?

aneilbaboo commented 6 years ago

Issue moved to precisely/bioinformatics #9 via ZenHub