legaultmarc / cohort-manager

Utility to manage and explore collections of phenotype data.
2 stars 2 forks source link

Improve/Add automatic detection of variables and tools to build cohorts without configuration files. #13

Closed legaultmarc closed 8 years ago

legaultmarc commented 8 years ago

The current solution of having a YAML configuration file that gets parsed to build the database really isn't scalable and would restrict the use of the tool.

It would be best to define a set of commands for the REPL (and/or future GUI implementations) that would facilitate importing data from common formats (e.g. CSV).

These commands should then allow semi-automatic importation and infer data types as well as phenotype structure automatically.

legaultmarc commented 8 years ago

I started refactoring to use the cohort_manager.inference module for such things (will be pushed soonish). The interface for the REPL could look like this:

> import csv my_file.csv delim=',' header=0
# Found 5 columns, verify the following information, then press enter:
[
    {"name": "Name", "variable_type": None},
    {"name": "Age", "variable_type": "continuous"},
    {"name": "Height", "variable_type": "continuous"}
    {"name": "Tall", "variable_type": "discrete", "parent": "height"},
    {"name": "FavoriteWeather", "variable_type": "factor"},
]

Users could also add the other meta fields (e.g. {"icd10": ...}). In this example, I also correctly inferred the parent relationship between "Tall" and "Height". This will be a lot harder in practice.