bdolly commented 5 years ago

Expanding upon the previous discovery work for the Data Tracker to further define business goals and needs for each of the personas identified in the design sprint.

Refine the problem statement for the this sprint to gain alignment on goals and scope of work to be completed. This is essential as the file upload process is the first phase of a larger process so scope is important to define early.

bdolly commented 5 years ago

Sketches generated from the discussion of persona business needs and goals.

Business Goals (per persona)

QC Magager (Coordinator)

Main Goal: Data Accountability
Transparency
- Ensure data quality through a process of accouting for know entity values against expected values
- Identify the origin of data quality discrepancies
Efficiency
- Decrease the time it takes to locate all data resources for a given study
- Decrease in the amount of time it takes to generate a data quality report
Communication
- Succinctly communicate data discrepancy origins to relevant parties
Organization
- Add semantic meaning to structure of all source data for a study
- Capture the provenance of the changes of study source data over time

X01 Investigator

Main Goal: Data Analysis To ensure the accuracy and quality of all raw study source data by providing sufficient information on study design and data context needed to accurately harmonize given source data in order to expedite the ingestion of my studies data into the DRC harmonization pipeline.

bdolly commented 5 years ago

First Draft of problem statement

bdolly commented 5 years ago

Second Draft of problem statement refined to be more concise and scoped

Problem Statement

The current state of the DRC study data ingestion process has focused primarily on accomodating the non-semantic structure and dis-organization of provided "raw" phenotypic and clinical study data files.

What the exisitng DRC data ingestion process fails to address is accomodating users by providing a standardized and constrained (version controlled) process of providing "raw" phenotypic and clinical study data files. This lack of standardiztion and meaningful organization renders the harmonization process incapable of efficiently accounting for discrepancies between the derived current state and the expected state of a study based on study files provided by X01s.

The study onboarding/registration process will address the study data accounting inefficiencies by providing a standardized study file submission mechanism connected to a centralized file storage, providing a single point of truth for un-processed study files. The mechanism will provide necessary constraints to the structure and inter-relationships of file submissions to ensure that all provided study files meet a minimum set of criteria before being considered for ingested into the DRC harmonization pipeline. This set of constraints will ensure the quality and accuracy of pre-processsed study files by faciliating the semantic organization and validation of user sumbitted study data files.

Our intial focus will be on an interface that allows for the validation, annotation, and submission of files containing raw phenotypic and clinical data relating to a specific study batch. The validation step will only take into consideration the mapping of tabular data headers (column names) to the set of DRC required headers and NOT the sterility/correctness of underlying values in those columns. As part of the validation criteria users will be required to give further annotation to their motivations/reasons behind their chosen mapping schema. Upon validation approval the user can then submit and track their file as it progresses through the stages of the DRC harmonization pipeline.

bdolly commented 5 years ago

@baileyckelly please review and approve the problem statement ☝️ above to confirm alignment with the provided requirements doc https://docs.google.com/document/d/1NsHjwNJo_W8_YChGpEvIc6IYEJGZ-rMfLdJgjIZUZjc/edit?usp=sharing

allisonheath commented 5 years ago

Reworking it just a bit:

The current state of the DRC study data ingestion has relied on on mostly ad-hoc processes for turning investigator and sequencing center provided study data files into harmonized clinical/phenotypic data that is correctly linked genomic data files. This existing process thus is a point of friction from data to information because it fails to provide users a standardized framework that allows iterative processes for providing “raw” study data files. This friction renders the harmonization process incapable of efficiently accounting for discrepancies between the derived current state and the expected state of a study based on study files provided by investigators.

The study onboarding/registration process will address the study data accounting inefficiencies by providing a standardized study file submission mechanism connected to a centralized file storage, providing a single point of truth for un-processed study files. Using this mechanism, appropriate checks will be performed that evaluate a minimum set of criteria from the submitted files before further ingestion into the DRC. This set of constraints will ensure a baseline of semantic organization and validation of user submitted study data files, thus reducing the initial friction of data ingestion and allowing more effective use of deeper curation efforts as needed.

Our initial focus will be on an interface that allows for the validation, annotation, and submission of files containing raw phenotypic and clinical data relating to a specific study batch. The validation step will only take into consideration the mapping of tabular data headers (column names) to the set of DRC defined fields and NOT the sterility/correctness of underlying values in those columns. Upon validation approval, users can then submit and track their study files as it progresses through the stages of ingestion into the DRC.

The one sentence I left out for now because I wasn't quite sure what it was trying to get at is: "As part of the validation criteria users will be required to give further annotation to their motivations/reasons behind their chosen mapping schema." - maybe it should be included, but need a bit more explanation of what that really means?

kids-first / kf-ui-data-tracker

Study File Upload: Define Business Goals per persona & Refine problem statement #2

Business Goals (per persona)

QC Magager (Coordinator)

X01 Investigator

Problem Statement