qunfengdong / DCN

2 stars 4 forks source link

Disease Correlation Network (DCN) Analyser

Disease Correlation Network (DCN) Analyser is a pipeline of disease correlation analysis with retrospective matched cohort study design using Cox Proportional Hazards (Cox-PH) regression in combination of interactive network display using graph theory. It allows clinicians to explore the relationships between any statistically disease pairs easily by studying the network with customized filtering, rearranging the network and calculating all the possible path between disease pairs.

Prerequisities

Environment setup

If you do not have the above environment, please set it up via the following conda commands:

$ conda create --name DCN python=3.7 r=3.5 r-essentials
$ conda activate DCN

You can change the environment name DCN to anything you want.

After you are done, use the following code to deactivate.

$ conda deactivate

Citation

Lin H, Rong R, Gao X, Revanna K, Zhao M, Bajic P, Jin D, Hu C, Dong Q. Disease correlation network: a computational package for identifying temporal correlations between disease states from Large-Scale longitudinal medical records. JAMIA Open. 2019 Aug 23;2(3):353-359. doi: 10.1093/jamiaopen/ooz031. PMID: 31984368; PMCID: PMC6952009.

Live interactive network display for the paper: http://cbi.lumc.edu/disease/.

Install

To check out the source code, go to https://github.com/qunfengdong/DCN. To obtain the scripts and example DCN files, do the following:

$ git clone https://github.com/qunfengdong/DCN.git
$ cd DCN

After the github repository is cloned, you will find a folder named DCN. All the scripts and example data files will be included in it.

Test dataset

$ tar zxvf test.tar.gz
  1. View the example input file: cdc_first_test.tsv
  2. View the example meta file: disease_code_test.tsv
  3. View the time table: \<DiseaseA>_\<DiseaseB>.csv
  4. View the survival curves of significant disease pairs based on Cox-PH regression: \<DiseaseA>_\<DiseaseB>.cox.survival.png and residual plot: \<DiseaseA>_\<DiseaseB>.cox.residual.png
  5. Click on the index.html to view the result from example data.
  6. View the survival curves of significant disease pairs based on Random Forest survival analysis: \<DiseaseA>_\<DiseaseB>.rf.survival.png

Running test

You will need to run the following codes inside the DCN folder to complete the test:

Then just click-open the index.html inside web_server, you will see the network display. You can compare your results under the example folder with files in test folder.

NOTE: Files: CoxPH.edge.csv, RF.edge.csv, all.edges.csv.js, and all PNG files should be in the same folder as the index.html.

Quick start

  1. Find all the exposed disease pairs;
  2. Find all matched non-exposed disease pairs;
  3. Perform Cox-PH regression on the two cohorts;
  4. Perform Random Forest survival analysis on the two cohorts;
  5. Visualize the results

Required input files

P00001  62      2012-01-16      66      F       White
P00002  509     2009-01-12      68      M       African American
P00002  23      2014-11-07      73      M       African American
P00003  23      2015-07-20      78      M       White
P00007  44      2014-04-28      51      M       White
P00008  509     2010-07-12      54      M       White
P00015  509     2008-04-21      60      M       White
P00016  63      2008-03-16      83      F       White
P00017  23      2014-11-13      48      F       White
P00018  509     2011-07-03      76      M       African American
P00018  14      2012-09-20      77      M       African American
P00018  23      2014-01-09      79      M       African American
P00019  23      2015-01-09      89      M       White
P00020  14      2012-05-25      65      F       African American
id_col name_long
14 Prostate Cancer
23 Glaucoma
29 Obesity
509 Med:Progestins
43 Acute Myeloid Leukemia
44 Vitiligo
59 Chalazion
62 Cirrhosis
63 Psoriasis
67 Sarcoidosis

Getting started

NOTE: You will need to enter your own file paths in the code below.

Step 1: a0_generator.r

Find all the exposed and matching non-exposed disease pairs.

$ Rscript a0_generator.r -i /path/to/test/cdc_first_test.tsv -m /path/to/test/disease_code_test.tsv --outfolder /path/to/outputfolder

More options available:

$ Rscript a0_generator.r -h
data.table doParallel    foreach   optparse 
      TRUE       TRUE       TRUE       TRUE 
Usage: Rscript a0_generator.r -i <inputFileName> -m <metaFileName> [options]

         >> This the step 1 of EMR package. << 
    This R script will find the exposure population for all diesease pairs appearing in the input file that satisfy disease A -> disease B, and and second, it will find the matching non-exposed population for all diesease pairs with some criteria, such as the matching patient should have the same gender and race as a good match control. Outputs are CSV files with paired exposed and non-exposed subjects information. One folder named T0_include with CSV files with time-to-event = 0 could be produced if you turn on the --saveT0 flag. Those files will be used as inputs for step 2 of the package.

Options:
    -i CHARACTER, --infile=CHARACTER
        Input file name [required]. 
        Should be tab delimited with 6 columns in the order of patient_ID, Disease, Disease_Date ([%Y-%m-%d] or [%Y/%m/%d]), Age, Gender, Race. No header is needed. You could provide more information about the patient (additional columns), but only the measurement at disease A will be recorded

    -m CHARACTER, --metafile=CHARACTER
        Meta file for each disease name [required]. 
        First column is the disease ID used in the input file column Disease, second column is the disease name. All the other columns should be additional attribute of the disease, tab separated. Header is needed

    --outfolder=CHARACTER
        Intermediate files output folder, default is current directory

    --maxn=INTEGER
        Maximum number of subjects to be qualified as valid exposed cohort, default is 10000. If the total number of patients exceed this number, a subset of 5000 subjects will be randomly selected.

    --minn=INTEGER
        Minimum number of subjects to be qualified as valid exposed cohort, default is 100

    -c INTEGER, --minCaseNumber=INTEGER
        Minimum number of A->B cases to keep disease pair, default is 20

    --duration=INTEGER
        study duration (years), default is 5 years

    -y INTEGER, --year=INTEGER
        Age difference between exposed and non-exposed, default is 5 years

    -d INTEGER, --days=INTEGER
        Visit date difference between exposed and non-exposed, unit: days, default is 7 days

    --saveT0
        A flag to save the cohort tables with time-to-event = 0. The output CSVs will be saved to T0_include folder, default is false

    -p INTEGER, --processors=INTEGER
        Number of cores/CPUs to use, default is 8

    -h, --help
        Show this help message and exit
ID type time event V3 V4 V5 V6
P41812 0 350 0 9/4/11 68 F White
P39443 0 770 0 11/7/10 52 M White
P41112 0 0 0 3/4/10 70 M White
P37186 0 0 0 12/24/16 67 F White
P28958 0 146 0 10/5/14 59 F White
P15618 0 1004 0 6/12/08 67 M White
P14117 0 0 0 9/12/08 78 F White
P24383 0 125 0 9/1/08 20 F White
P00008 1 0 0 9/4/15 60 M White
P00010 1 0 0 9/18/11 79 M White
P00018 1 0 0 1/9/14 79 M African American
P00028 1 805 0 6/5/10 55 M White
P00040 1 222 0 1/4/10 85 F White

Step 2: a1_analyzer.r

Cox-PH regression

$ Rscript a1_analyzer.r -m /path/to/disease_code_test.tsv --inputfolder /path/to/all_csv_files/ --outputfolder /path/to/all_csv_files/outputs
from to coef exp_coef se_coef z Pr N HRtest-p Pr.BH from_name_long to_name_long
14 23 1.337 3.809 0.137 9.786 <0.001 552 0.687 <0.001 Prostate Cancer Glaucoma
23 14 1.056 2.875 0.138 7.631 <0.001 1568 0.942 <0.001 Glaucoma Prostate Cancer
23 29 1.691 5.425 0.485 3.488 <0.001 1568 0.976 0.002 Glaucoma Obesity
23 43 0.615 1.850 0.183 3.358 0.001 1568 0.677 0.002 Glaucoma Acute Myeloid Leukemia
23 44 0.901 2.463 0.265 3.404 0.001 1568 0.506 0.002 Glaucoma Vitiligo
23 509 1.651 5.214 0.173 9.561 <0.001 1568 0.279 <0.001 Glaucoma Med:Progestins
23 63 0.769 2.158 0.239 3.212 0.001 1568 0.187 0.004 Glaucoma Psoriasis
29 23 1.069 2.914 0.201 5.319 <0.001 244 0.212 <0.001 Obesity Glaucoma
43 23 0.682 1.977 0.158 4.323 <0.001 350 0.069 <0.001 Acute Myeloid Leukemia Glaucoma
44 509 0.550 1.732 0.248 2.213 0.027 424 0.473 0.057 Vitiligo Med:Progestins
509 23 0.529 1.697 0.102 5.179 <0.001 1088 0.302 <0.001 Med:Progestins Glaucoma

example of CoxPH survival curve

In this test example, it shows the probability of getting Glaucoma across time between the population with and without Prostate Cancer. The strata =0 indicates the population without Prostate Cancer, while strata =1 indicates with Prostate Cancer. It suggests that people with Prostate Cancer develop Glaucoma faster than those who do not have Prostate Cancer.

example of CoxPH residual curve

For the same Prostate Cancer to Glaucoma disease pair, the residuals lie closely to the diagonal line, meaning the data has a relative good fit for Cox-PH model.

example of dropoff hist

It examines the drop-off subjects distribution for the two populations (having Prostate Cancer and not), and test whether they are similar. Given the P-value = 0.027, which is < 0.05, suggesting that those two distributions are statistically not similar to each other. Users should use caution when considering it to be a meaningful disease correlation.

Random Forest survival analysis

Rscript a1_analyzer.r --method RF -m /path/to/disease_code_test.tsv --inputfolder /path/to/all_csv_files/ --outputfolder /path/to/all_csv_files/outputs
from to Pr_wil Pr_t NonExposedMean exposedMean Pr_wil.BH Pr_t.BH from_name_long to_name_long
14 23 <0.001 <0.001 1501.946 982.018 <0.001 <0.001 Prostate Cancer Glaucoma
14 509 0.968 0.810 1482.072 1493.435 1.000 0.864 Prostate Cancer Med:Progestins
23 14 <0.001 <0.001 1665.489 1466.806 <0.001 <0.001 Glaucoma Prostate Cancer
23 29 <0.001 <0.001 1712.080 1681.140 <0.001 <0.001 Glaucoma Obesity
23 43 <0.001 <0.001 1661.658 1587.812 <0.001 <0.001 Glaucoma Acute Myeloid Leukemia
23 44 <0.001 0.001 1724.847 1680.445 <0.001 0.002 Glaucoma Vitiligo
23 509 <0.001 <0.001 1759.704 1557.293 <0.001 <0.001 Glaucoma Med:Progestins
23 59 0.001 0.581 1751.787 1752.578 0.002 0.661 Glaucoma Chalazion
23 62 0.990 0.893 1694.242 1702.241 1.000 0.920 Glaucoma Cirrhosis
23 63 <0.001 <0.001 1653.121 1607.370 <0.001 <0.001 Glaucoma Psoriasis
23 67 0.172 0.467 1778.002 1771.577 0.254 0.615 Glaucoma Sarcoidosis

example of RF survival curve

$ Rscript a1_analyzer.r 

 randomForestSRC 2.5.0 

 Type rfsrc.news() to see new features, changes, and bug fixes. 

       survival randomForestSRC      data.table      doParallel         foreach 
           TRUE            TRUE            TRUE            TRUE            TRUE 
         OIsurv        optparse         ggplot2       ggfortify       gridExtra 
           TRUE            TRUE            TRUE            TRUE            TRUE 
Usage: Rscript a1_analyzer.r -i <inputFileName> -m <metaFileName> [options]

         >> This is the step 2 of EMR package. << 
    This R script will perform either Cox-PH regression or random forest survival analysis on the valid files in result folder from step 1. 
        Outputs:
         > CoxPH: an adjacency matrix with all the significant hazard ratios, hazard ratio test p-values, and survival curve and residual graphs.
         > RF: Wilcoxon and T test p-values, mean time to event in exposed and non-exposed group, and survival curve graphs. 
        *Multiple test correction will be applied to the p-values in both methods.

Options:
    --inputfolder=CHARACTER
        Input folder name. 
        Should be a folder that contains all disease trajectories. Each file is named as NameA_NameB.csv. Default is current directory

    -m CHARACTER, --metafile=CHARACTER
        Meta file for each disease name [required]. 
        First column is the disease ID used in the input file column Disease, second column is the disease name. All the other columns should be additional attribute of the disease, tab separated. Header is needed

    --outputfolder=CHARACTER
        Survival analysis / random forest PNG output folder, default is the <inputfolder>

    --method=CHARACTER
        Survival Analysis method, choises are CoxPH, RF. The default is CoxPH

    -o CHARACTER, --outfile=CHARACTER
        Adjacency matrix output file name, default is <method>.edges.csv

    --duration=INTEGER
        study duration (years), default is 5 years

    -s DOUBLE, --sigcut=DOUBLE
        Significant P-value cutoff. Only p-values that are less than this cutoff will be considered as significant P-values, and kept in the adjacency matrix. Default is 0.05

    -e DOUBLE, --exp_coef=DOUBLE
        Exponentiated coefficients cutoff used in plotting survival curve. Default is 1.0

    -a CHARACTER, --adjust=CHARACTER
        Method of adjusting p-values, choices are holm, hochberg, hommel, bonferroni, BY, and fdr. The default is BH

    -n INTEGER, --ntree=INTEGER
        Number of trees to run random forest, default is 1000

    --pair
        A flag to control for matched subject in CoxPH. Default is false

    -p INTEGER, --processors=INTEGER
        Number of cores/CPUs to use, default is 8

    -h, --help
        Show this help message and exit

Step 3: a2_parse.py

$ python a2_parse.py /path/to/CoxPH.edge.csv

Screenshot of network disply

Step 4: Copy results to web server directory

cp /path/to/all_csv_files/outputs/* web_server

Version

Authors

Error report

Please report any errors or bugs to qunfengd@gmail.com.

It seems that "randomForestSRC 3.2.2" may cause problems for the following step:

Rscript a1_analyzer.r --method RF -m ./test/disease_code_test.tsv --inputfolder ./example --outputfolder ./example/output

Either try an older version of randomForestSRC, or just skip this step and analyze the results just based on the default Cox-PH regression.

License

GNU