Workflow 4: Orange Team Implementation of Modules 1-3

Module 1: Identify patient cohort (this requires identifying the overall cohort as well as implementing the user-supplied specification of each sub-cohort; the cohorts need not be disjoint - for example one subcohort might be the list of all individuals with asthma and the other sub-chort could be the list of all individuals with asthma and hypertension. That said, I would imagine most users would specifiy disjoint subcohorts)

Use Translator clinical services (COHD, ICEES, Clinical Profiles, Clinical Fingerprints) to define patient cohort and retrieve ICD/LOINC/MEDCTN codes Retain provenance of data, including relevant statistics (e.g., sample size, P values) Data ingest pipeline for generating clinical profiles from existing clinical datasets, including from standardized formats and data models. Module 2: Phenotype mapping to individuals (note that this could in principle be pre-computed and need not be done downstream of cohort identification unless is too expensive to do on all) Use LOINC2HPO tool, BioNames, or another tool/approach to map ICD/LOINC/MEDCTN codes to HPO identifiers Module 3: Cluster phenotypes (output is presumably a single phenotype or a list of phenotypes; one such output for each patient subcohort). Alternatively, rank all the phenotypes in a subcohort and the output the top 1 (or N) from this ranked list. Apply algorithm to cluster phenotypes; this module may be valuable if the output from Module 2 ahs a lot of very similar phenoypes and/or synonyms for the same phenotype. These can be collapsed into a single phenotype to reduce the verbosity being fed downstream to the rest of the workflow.

Orange Team Bids on these modules 2&3 2 = I can already provide code to answer this. 3 = I understand and I will develop code to answer it.

Please develop and provide the Github link to this code. And, ideally, any documentation or comments. A concrete example with inputs and outputs would be helpful.

ncats / translator-workflows

Workflow 4: Orange Team Implementation of Modules 1-3 #33

Retain provenance of data, including relevant statistics (e.g., sample size, P values) https://github.com/translational-informatics/clinical-profile-registry/blob/master/README.md