ncats / translator-workflows

12 stars 6 forks source link

Workflow 4: Orange Team Implementation of Modules 1-3 #33

Open jaredroach opened 5 years ago

jaredroach commented 5 years ago

Module 1: Identify patient cohort (this requires identifying the overall cohort as well as implementing the user-supplied specification of each sub-cohort; the cohorts need not be disjoint - for example one subcohort might be the list of all individuals with asthma and the other sub-chort could be the list of all individuals with asthma and hypertension. That said, I would imagine most users would specifiy disjoint subcohorts)

Use Translator clinical services (COHD, ICEES, Clinical Profiles, Clinical Fingerprints) to define patient cohort and retrieve ICD/LOINC/MEDCTN codes Retain provenance of data, including relevant statistics (e.g., sample size, P values) Data ingest pipeline for generating clinical profiles from existing clinical datasets, including from standardized formats and data models. Module 2: Phenotype mapping to individuals (note that this could in principle be pre-computed and need not be done downstream of cohort identification unless is too expensive to do on all) Use LOINC2HPO tool, BioNames, or another tool/approach to map ICD/LOINC/MEDCTN codes to HPO identifiers   Module 3: Cluster phenotypes (output is presumably a single phenotype or a list of phenotypes; one such output for each patient subcohort). Alternatively, rank all the phenotypes in a subcohort and the output the top 1 (or N) from this ranked list. Apply algorithm to cluster phenotypes; this module may be valuable if the output from Module 2 ahs a lot of very similar phenoypes and/or synonyms for the same phenotype. These can be collapsed into a single phenotype to reduce the verbosity being fed downstream to the rest of the workflow.

Orange Team Bids on these modules 2&3 2 = I can already provide code to answer this. 3 = I understand and I will develop code to answer it.

Please develop and provide the Github link to this code. And, ideally, any documentation or comments. A concrete example with inputs and outputs would be helpful.

jaredroach commented 5 years ago

Note that there are already two links to GitHub code for portions of these modules entered into the Bid Worksheet

Retain provenance of data, including relevant statistics (e.g., sample size, P values) https://github.com/translational-informatics/clinical-profile-registry/blob/master/README.md

Data ingest pipeline for generating clinical profiles from existing clinical datasets, including from standardized formats and data models. https://github.com/translational-informatics/crepes