Load the data - Githubissues

pjvandehaar commented 8 years ago

Data to import

/net/dumbo/home/larsf/PheWAS/PheWAS_code_translation_v1_2.txt (15k lines):

 "icd9"                            "icd9_string"  "phewas_code"            "phewas_string"  "exclude_range"  "rollup_bool"  "Ignore_bool"   "sex"  "pregnancy_related"  "category"      "category_string"
  "001"                                "Cholera"          "008"     "Intestinal infection"     "001-009.99"              1              0      ""                    0           1  "infectious diseases"
"001.0"         "Cholera due to Vibrio cholerae"          "008"     "Intestinal infection"     "001-009.99"              1              0      ""                    0           1  "infectious diseases"
"001.1"  "Cholera due to Vibrio cholerae el tor"          "008"     "Intestinal infection"     "001-009.99"              1              0      ""                    0           1  "infectious diseases"
"001.9"                            "Cholera NOS"          "008"     "Intestinal infection"     "001-009.99"              1              0      ""                    0           1  "infectious diseases"
  "002"         "Typhoid and paratyphoid fevers"          "008"     "Intestinal infection"     "001-009.99"              1              0      ""                    0           1  "infectious diseases"
"002.0"                          "Typhoid fever"        "008.5"      "Bacterial enteritis"     "001-009.99"              1              0  "Both"                    0           1  "infectious diseases"
    ...
"021.1"                      "Enteric tularemia"          "008"     "Intestinal infection"     "001-009.99"              1              0      ""                    0           1  "infectious diseases"
"021.2"                    "Pulmonary tularemia"          480.1      "Bacterial pneumonia"     "480-488.99"              1              0  "Both"                    0           9          "respiratory"
"021.3"               "Oculoglandular tularemia"            369     "Infection of the eye"     "369-374.99"              1              0      ""                    0           7         "sense organs"
"021.8"              "Other specified tularemia"          "041"  "Bacterial infection NOS"     "010-041.99"              1              0      ""                    0           1  "infectious diseases"
"021.9"                          "Tularemia NOS"          "041"  "Bacterial infection NOS"     "010-041.99"              1              0      ""                    0           1  "infectious diseases"
...

/net/dumbo/home/larsf/PheWAS/PheWAS_code_v1_2.txt (1488+1 lines):

phewas_code2  phewas_code         phewas_string      category_string  Cases  Controls  NR
         008            8  Intestinal infection  infectious diseases    460     17893   1
       008.5          8.5   Bacterial enteritis  infectious diseases    244     17893   2
...
       008.6          8.6       Viral Enteritis  infectious diseases    114     17893   5
         010           10          Tuberculosis  infectious diseases     39     16924   7
         038           38            Septicemia  infectious diseases    410     16924  10
...

/net/dumbo/home/larsf/PheWAS/MATCHED/PheWAS_${NR}_MATCHED.epacts:

#CHROM     BEGIN       END                  MARKER_ID    NS      AC  CALLRATE       MAF   PVALUE        BETA    SEBETA      CHISQ  NS.CASE  NS.CTRL   AF.CASE  AF.CTRL
     1   1005806   1005806    1:1005806_C/T_1:1005806  5060  1561.2         1   0.15427  0.48329    -0.06802  0.097689    0.49144      460     4600   0.14605   0.1551
     1   1079198   1079198    1:1079198_T/C_1:1079198  5060    2627         1   0.25959  0.93072  -0.0079779  0.091803  0.0075582      460     4600   0.25836  0.25971
     1   1247494   1247494    1:1247494_T/C_1:1247494  5060  8226.7         1   0.18708  0.51949   -0.059428  0.091832     0.4149      460     4600   0.80563  0.81365
     1   1723031   1723031    1:1723031_G/A_1:1723031  5060  5027.6         1    0.4968  0.12474    -0.10694  0.069707     2.3568      460     4600   0.47283  0.49919
     1   2069172   2069172    1:2069172_C/T_1:2069172  5060  2826.2         1   0.27927  0.45051   -0.058743  0.078132    0.56938      460     4600    0.2685  0.28034
     1   2069681   2069681    1:2069681_C/T_1:2069681  5060  626.91         1  0.061948  0.84678    0.027622   0.14258   0.037335      460     4600  0.063029  0.06184
     1   2205581   2205581    1:2205581_C/A_1:2205581  5060  4262.4         1   0.42119   0.4292   -0.058128  0.073623    0.62498      460     4600   0.40942  0.42237
     1   2387101   2387101    1:2387101_C/T_1:2387101  5060  4937.6         1   0.48791  0.40808   -0.057252  0.069225     0.6844      460     4600    0.4749  0.48921
     1   2392648   2392648    1:2392648_G/C_1:2392648  5060    5017         1   0.49575  0.57311   -0.038945  0.069117    0.31751      460     4600   0.48688  0.49663
...
     1  82074852  82074852  1:82074852_G/A_1:82074852  5060       0         1         0       NA          NA        NA         NA      460     4600         0        0
...

What shall we do about those NAs? Not insert them?

Where should we get colors for each category? I can't find them online. We either make them up, or pull them out of Vanderbilt's code.

pjvandehaar commented 8 years ago

psycopg2 looks like the best Python postgres library. It has parameterized queries with support for hstore, json, dates, etc, but no magic. Docs at http://initd.org/psycopg/docs/usage.html.

pjvandehaar commented 8 years ago

Speed

It currently loads 18k variants x 1500 phenos in 10 minutes.

For a while, python has one cpu at 100% while postgres holds one at 50%. Then python quits and postgres sits at 100%. Figure out the relative lengths and the reason for this.

At that rate, 30M variants should take 10 days. Can we parallelize it?

statgen / pheweb

Load the data #1

Data to import

Speed