ncats / translator-workflows

12 stars 6 forks source link

Workflow 5: Green/Gamma Team Implementation of Modules 1-4 & 5-8 #36

Open karafecho opened 5 years ago

karafecho commented 5 years ago

Scroll below to find updates to plan

Overview

Green/Gamma Team is approaching Workflow 5 using ICEES as the source of clinical data. In consideration of the design of ICEES, the team has decided to collapse Modules one through four of the workflow. In addition, a Jupyter notebook will be used to call ICEES and integrate with Gamma for subsequent modules.

Two questions will be asked:

  1. Question 1 - Find phenotypic profiles associated with environmental exposures (e.g., medication exposures, chemical exposures, socio-economic exposures) among patients with asthma-like disease who are or are not responsive to treatment?

This question will be fully evaluated by a SME (D. Peden) and serve as the basis of a TIDBIT.

  1. Question 2 - Find phenotypic profiles associated with environmental exposures (e.g., medication exposures, chemical exposures, socio-economic exposures) among patients with asthma-like disease who live in rural areas versus urban areas?

This question will allow us to begin to more thoroughly explore the ACS data available through our Socioenvironmental Exposures API in the context of a workflow. In particular, the question will allow us to "stress test" our binning strategy.

Note that the output of modules one through four will be of the same entity type for both Question 1 and Question 2; thus, subsequent modules for workflow 5 will be identical for both questions.

karafecho commented 5 years ago

Question 1: ICEES functionality 4

Input parameters: features (TotalEDInpatientVisits <2 or >=2) version: 1.0.0 table: patient year: 2010 cohort_id: COHORT:22

Output (from full output): PM2.5 ozone medications socio-economic exposures cannot serve as input in downstream modules

Output includes counts of patients by bin, adjusted Chi Square Statistics, and P values

Exposures that significantly differ between the two groups of patients will be used as input for module five, but separate streams of operations will be maintained with annotation indicating which group was "higher" and which group was "lower".

karafecho commented 5 years ago

Question 2: ICEES functionality 4

Input parameters: features (EstResidentialDensity <4 or >=4 OR <5 or >=5) To be decided or rebinned version: 1.0.0 table: patient year: 2010 cohort_id: COHORT:22

*The US Census Bureau identifies two types of urban areas: Urbanized Areas (UAs) of 50,000 or more people; Urban Clusters (UCs) of at least 2,500 and less than 50,000 people; and “Rural” or all population, housing, and territory not included within one of the two urban areas.

The ICEES patient population is largely rural, so the Census Bureau definitions may not work/apply with this use case.*

Output (from full output): PM2.5 ozone medications socio-economic features cannot serve as input in downstream modules

Output includes counts of patients by bin, adjusted Chi Square Statistics, and P values

Exposures that significantly differ between the two groups of patients will be used as input for module five, but separate streams of operations will be maintained with annotation indicating which group was "higher" and which group was "lower".

karafecho commented 5 years ago

Note that Green/Gamma is working with the BioLink folks to develop high-level concepts for ICEES feature variables, in order to properly incorporate ICEES data into the BioLink data model.

colinkcurtis commented 5 years ago

See https://github.com/ncats/translator-workflows/blob/master/Workflow5/Workflow5_notebook.ipynb

karafecho commented 5 years ago

US Census Bureau rural, urban definitions

The Census Bureau's urban-rural classification is fundamentally a delineation of geographical areas, identifying both individual urban areas and the rural areas of the nation. The Census Bureau's urban areas represent densely developed territory, and encompass residential, commercial, and other non-residential urban land uses. For the 2010 Census, an urban area will comprise a densely settled core of census tracts and/or census blocks that meet minimum population density requirements, along with adjacent territory containing non-residential urban land uses as well as territory with low population density included to link outlying densely settled territory with the densely settled core. To qualify as an urban area, the territory identified according to criteria must encompass at least 2,500 people, at least 1,500 of which reside outside institutional group quarters.

The Census Bureau identifies two types of urban areas:

Urbanized Areas (UAs) of 50,000 or more people; Urban Clusters (UCs) of at least 2,500 and less than 50,000 people. "Rural" encompasses all population, housing, and territory not included within an urban area.

The specific criteria used to define urban areas for the 2010 Census were published in the Federal Register of August 24, 2011.

karafecho commented 5 years ago

@xu-hao : Let's bin EstResidentialDensity as defined above by the US Census Bureau.

cmungall commented 5 years ago

cc-ing @diatomsRcool who is working on ECTO exposures

karafecho commented 5 years ago

Thanks, Chris!

@balhoff @stevencox : Let's coordinate with @diatomsRcool and perhaps loop in Sarav and Alex Valencia (his student).

diatomsRcool commented 5 years ago

Do we need a meeting? I'm really not up to speed on translator stuff.

cmungall commented 5 years ago

Don't worry, I think it's premature to have a meeting, at most an ECTO ticket request

karafecho commented 5 years ago

Agreed! My intent was simply to make sure that we coordinate (and not duplicate) efforts.

karafecho commented 5 years ago

Updated plan for implementation of Workflow 5:

  1. Use functionality four in ICEES to stratify/cluster by TotalEDInpatientVisits (<2 vs >=2) and return chemical exposures that demonstrate a significant difference between the strata. The exposures will be airborne pollutants and medications. The output list will be passed to ROBOKOP for execution of queries in the form: "chemical substance -> gene -> biological process/activity -> phenotype".

  2. Use functionality four in ICEES to stratify/cluster by EstResidentialDensity (1 [rural] vs 2 [urban]) and return chemical exposures that demonstrate a significant difference between the strata. The exposures will be airborne pollutants and medications. The output list will be passed to ROBOKOP for execution of queries in the form: "chemical substance -> gene -> biological process/activity -> phenotype".

  3. Use functionality four in ICEES to stratify/cluster by Sex2 (Male vs Female) and return chemical exposures that demonstrate a significant difference between the strata. The exposures will be airborne pollutants and medications. The output list will be passed to ROBOKOP for execution of queries in the form: "chemical substance -> gene -> biological process/activity -> phenotype".

  4. Use COHD to stratify/cluster by Sex. See COHD UI and a query template plus a specific instance of Workflow 5. Retrieve top 20 medications (based on frequency) for each sub-cohort. The output list will be passed to ROBOKOP for execution of queries in the form: "chemical substance -> gene -> biological process/activity -> phenotype".

  5. Use Clinical Profiles to identify/create sub-cohorts of males and females with asthma. Retrieve top 20 medications (based on frequency) for each sub-cohort. The output list will be passed to ROBOKOP for execution of queries in the form: "chemical substance -> gene -> biological process/activity -> phenotype".

Note re ICEES:

We will need to capture directionality as part of the output for the workflow. By "directionality", I mean that we need to capture which strata is "enriched" for a given phenotype (i.e., has a higher percentage of patients with XXX). The Chi Square statistic that ICEES provides informs one of differences between groups or bins, but it does not provide any information on the directionality of the differences. Relative risks and odds ratios may suffice.

Notes re (1)-(5) above:

  1. @mbrush, @stevencox, and several other folks have been working to identify the best approach for extending the BioLink data model to support clinical data. For ICEES, we tentatively converged on a 'BioLink' adapter to execute Workflow 5, but we are considering alternative approaches.
  2. Ideally, (1)-(5) will invoke not just ROBOKOP, but other 'Reasoners'.

A. ICEES example query

Input:

Feature variables: AvgDailyPM2.5Exposures < 3, TotalEDInpatientVisits < 2 Version of data: 1.0.0 Table: patient Year: 2010 Cohort ID: COHORT:22

Output:*

+----------------------------+------------------------------+-------------------------------+---------+ | feature | TotalEDInpatientVisits < 2 | TotalEDInpatientVisits >= 2 | | +============================+==============================+===============================+=========+ | AvgDailyPM2.5Exposure < 3 | 297 91.10% | 29 8.90% | 326 | | | 5.85% 4.66% | 2.22% 0.45% | 5.11% | +----------------------------+------------------------------+-------------------------------+---------+ | AvgDailyPM2.5Exposure >= 3 | 4776 78.90% | 1277 21.10% | 6053 | | | 94.15% 74.87% | 97.78% 20.02% | 94.89% | +----------------------------+------------------------------+-------------------------------+---------+ | | 5073 | 1306 | 6379 | | | 79.53% | 20.47% | 100.00% | +----------------------------+------------------------------+-------------------------------+---------+ +-------------+---------------+ | p_value | chi_squared | +=============+===============+ | 3.16593e-06 | 28.2841 | +-------------+---------------+ *AvgDailyPM2.5Exposure <3 range: 1.58, 9.63 µg/m3; AvgDailyPM2.5Exposure >=3 range: 9.63, 17.33 µg/m3; TotalEDInpatientVisits = # emergency department or inpatient visits for a respiratory issue over a one-year ‘study’ period (the example here is for calendar year 2010).

B. COHD example queries

Input: Asthma (ID #317009) and Black or African American (ID #8516)

Output: { "concept_2_count": 208438, "concept_id_1": 317009, "concept_id_2": 8516, "concept_pair_count": 11716, "dataset_id": 2, "relative_frequency": 0.05620856081904451 }

Input: Asthma (ID #317009) and White (ID #8527)

Output: { "concept_2_count": 601167, "concept_id_1": 317009, "concept_id_2": 8527, "concept_pair_count": 29913, "dataset_id": 2, "relative_frequency": 0.049758220261591206 }

C. Clinical Profiles links

HAPI-FHIR

Custom Translator JHU Clinical Profiles Build

qianzhu2018 commented 5 years ago

Hi Kara, just curiously is there any reason COHD only run implementation by Sex? instead of doing the same experiments as ICEES, then we can do comparison or cross validation afterwards? Is the plan proposed for Hackathon? Thanks! Qian

karafecho commented 5 years ago

Hi Qian. The variables defined in (1) and (2) above are specific to ICEES and not available in COHD. (3) and (4) are intended to cross-validate output, as you noted. I'm hoping to do something similar for Green Team's Implementation of Workflow 4.

WRT the hackathon, I'm hoping that we can extend the plan above to include additional teams.

karafecho commented 5 years ago

@stevencox @colinkcurtis @xu-hao : I'm wondering where we stand with (1) above, in terms of modules 1-4 and modules 5-8. I realize you all shifted your focus to (2), but I think (1) might serve best as a use case for SME evaluation (Dave Peden) during the hackathon. Plus, I'm developing a second ICEES manuscript that follows the first one and will focus on the outcome variable 'TotalEDInpatientVisits', so execution of (1) would align nicely with those efforts.

colinkcurtis commented 5 years ago

@karafecho I will pivot towards (1) again. In what I have been doing it was incidental that I began focusing on (2). I'll update when I have an executable CWL/Ros WF5 for (1). Tentatively, before Monday.

karafecho commented 5 years ago

@webyrd @dkoslicki : Please take a look at the above Green/Gamma action plan for execution of Workflow 5, Modules 1-4, as well as the action plan for execution of Workflow 5, Modules 5-8 (#37). If you're interested, I'd be happy to discuss approaches for Alpha and X-Ray to contribute to this workflow.

karafecho commented 5 years ago

See TranQL implementation of Workflow 5, which is related to Workflow 4, here.

karafecho commented 5 years ago

WORKFLOW INPUT:

See ICEES_FeatureVariables and ICEES_Identifiers here for chemicals and medications. Note that these docs are updated as new variables are added to the ICEES integrated feature tables.

WORKFLOW (Gamma) QUESTION TEMPLATE:

{ "name": "Gamma WF5 template", "natural_question": "Chemical to gene to biological process/activity to phenotypic feature association.", "notes": "", "machine_question": { "nodes": [ { "id": "n0", "curie": "PUBCHEM:441335", "name": "Mometasone", "type": "chemical_substance" }, { "id": "n1", "type": "gene" }, { "id": "n2", "type": "biological_process_or_activity" }, { "id": "n3", "type": "phenotypic_feature" } ], "edges": [ { "id": "e0", "source_id": "n0", "target_id": "n1" }, { "id": "e1", "source_id": "n1", "target_id": "n2" }, { "id": "e2", "source_id": "n2", "target_id": "n3" } ] } }

karafecho commented 5 years ago

ROBOKOP queries and RTX queries are being pre-computed for this workflow using all available ICEES chemicals and medications. Example ICEES queries are included below as an FYI:

curl -k -XPOST https://localhost:8080/1.0.0/patient/2010/cohort/COHORT:22/associations_to_all_features -H "Content-Type: application/json" -d '{"feature":{"TotalEDInpatientVisits":{"operator":"<", "value":2}},"maximum_p_value":0.1}' -H "Accept: application/json"

curl -k -XPOST https://localhost:8080/1.0.0/patient/2010/cohort/COHORT:22/associations_to_all_features -H "Content-Type: application/json" -d '{"feature":{"ur":{"operator":"=", "value":"U"}},"maximum_p_value":0.1}' -H "Accept: application/json"

curl -k -XPOST https://localhost:8080/1.0.0/patient/2010/cohort/COHORT:22/associations_to_all_features -H "Content-Type: application/json" -d '{"feature":{"Sex2":{"operator":"=", "value":"Male"}},"maximum_p_value":0.1}' -H "Accept: application/json"

karafecho commented 5 years ago

Green/Gamma initial plan is to refine end-to-end execution of WF5 using TranQL, with ICEES/COHD/Clinical Profiles for execution of modules 1-4 input and ROBOKOP/RTX/mediKanren for execution of modules 5-8.

karafecho commented 5 years ago

Mini-hackathon was held on Friday, April 12, 12-4 pm ET. Topic: Unified Translator-compliant Clinical Knowledge Source API. Attendees: Hao Xu, Richard Zhu, Casey Ta, Steve Cos, and Kara Fecho. Event was successful. Team developed a plan of action and is moving forward with execution of the plan. The unified Translator Clinical Knowledge Source API will foster efforts on Workflows 4 and 5, as well as any efforts related to COHD, Clinical Profiles, and ICEES.