greenelab / connectivity-search-backend

Django backend for hetnet connectivity search
https://search-api.het.io
BSD 3-Clause "New" or "Revised" License
6 stars 2 forks source link

Populate PathCount and DGP tables #8

Closed dhimmel closed 5 years ago

dhimmel commented 5 years ago

As well as other enhancements to the db import.

dhimmel commented 5 years ago

Using the code in https://github.com/greenelab/hetmech-backend/pull/8/commits/21b80d79699af7ff0622d677ac511ae73281942a, I created a database with all tables, but only including dgp / path count info for paths of length 1. Here's the stdout:

_download_hetionet_hetmat(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7f724eed6860>) ran in 0:00:01
_hetionet_graph(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7f724eed6860>) ran in 0:02:04
_populate_metanode_table() ran in 0:00:00
_populate_node_table() ran in 0:00:13
_populate_metapath_table() ran in 0:00:01
_download_path_counts(length=1) ran in 0:00:00
_populate_degree_grouped_permutation_table(length=1) ran in 0:02:01
_populate_path_count_table() ran in 0:57:29
dhimmel commented 5 years ago

Database summary as of https://github.com/greenelab/hetmech-backend/pull/8/commits/8a616951b4b9bf6619683f430893fdd95975366f. Created by running python manage.py database_info:

################################ Metanode Table ################################
11 rows

identifier abbreviation  n_nodes
           Anatomy            A      402
Biological Process           BP    11381
Cellular Component           CC     1391
          Compound            C     1552
           Disease            D      137 

################################## Node Table ##################################
47,031 rows

id metanode_id      identifier identifier_type                       name                                            url                                               data
 1     Anatomy  UBERON:0000002             str             uterine cervix  http://purl.obolibrary.org/obo/UBERON_0000002  {'url': 'http://purl.obolibrary.org/obo/UBERON...
 2     Anatomy  UBERON:0000004             str                       nose  http://purl.obolibrary.org/obo/UBERON_0000004  {'url': 'http://purl.obolibrary.org/obo/UBERON...
 3     Anatomy  UBERON:0000006             str        islet of Langerhans  http://purl.obolibrary.org/obo/UBERON_0000006  {'url': 'http://purl.obolibrary.org/obo/UBERON...
 4     Anatomy  UBERON:0000007             str            pituitary gland  http://purl.obolibrary.org/obo/UBERON_0000007  {'url': 'http://purl.obolibrary.org/obo/UBERON...
 5     Anatomy  UBERON:0000010             str  peripheral nervous system  http://purl.obolibrary.org/obo/UBERON_0000010  {'url': 'http://purl.obolibrary.org/obo/UBERON... 

################################ Metapath Table ################################
2,205 rows

abbreviation                                  name           source_id target_id  length  path_count_density  path_count_mean  path_count_max  dwpc_raw_mean
        AlD             Anatomy–localizes–Disease             Anatomy   Disease       1            0.065403         0.065403               1       0.003746
        AdG            Anatomy–downregulates–Gene             Anatomy      Gene       1            0.012143         0.012143               1       0.000078
        AeG                Anatomy–expresses–Gene             Anatomy      Gene       1            0.062520         0.062520               1       0.000141
        AuG              Anatomy–upregulates–Gene             Anatomy      Gene       1            0.011621         0.011621               1       0.000083
       BPpG  Biological Process–participates–Gene  Biological Process      Gene       1            0.002347         0.002347               1       0.000031 

######################## DegreeGroupedPermutation Table ########################
652,163 rows

id metapath_id  source_degree  target_degree    n_dwpcs  n_nonzero_dwpcs  nonzero_mean  nonzero_sd
 1         AdG              0              0  428073600                0           NaN         NaN
 2         AdG              0              1  115216800                0           NaN         NaN
 3         AdG              0              2  114558000                0           NaN         NaN
 4         AdG              0              3  104676000                0           NaN         NaN
 5         AdG              0              4  103212000                0           NaN         NaN 

############################### PathCount Table ################################
2,067,190 rows

id metapath_id  source_id  target_id  dgp_id  path_count      dwpc   p_value
 1         DuG      14727      14919  381107           1  6.357984       NaN
 2         DuG      14727      15018  381106           1  6.704556  0.016184
 3         DuG      14727      15073  381107           1  6.357984       NaN
 4         DuG      14727      15146  381107           1  6.357984       NaN
 5         DuG      14727      15162  381106           1  6.704556  0.016184 
dhimmel commented 5 years ago

@dongbohu should be ready for review. Still haven't done a full import and not sure how long that'll take. Perhaps several days.

dhimmel commented 5 years ago

The import has been running for ~3 days now.

The current database size is 119 GB, as per du --human-readable --summarize database.

The database_info management command shows:

######################## DegreeGroupedPermutation Table ########################
37,905,389 rows

id metapath_id  source_degree  target_degree    n_dwpcs  n_nonzero_dwpcs  nonzero_mean  nonzero_sd
 1         AdG              0              0  428073600                0           NaN         NaN
 2         AdG              0              1  115216800                0           NaN         NaN
 3         AdG              0              2  114558000                0           NaN         NaN
 4         AdG              0              3  104676000                0           NaN         NaN
 5         AdG              0              4  103212000                0           NaN         NaN 

############################### PathCount Table ################################
325,337,584 rows

id metapath_id  source_id  target_id    dgp_id  path_count      dwpc   p_value
 1     CbGdDpS      13176      46700  21798438           1  2.325159  0.053367
 2     CbGdDpS      13176      46855  21798441           1  2.169464  0.074498
 3     CbGdDpS      13178      46600  21799313           2  3.541697  0.043412
 4     CbGdDpS      13178      46602  21799308           2  4.031587  0.015953
 5     CbGdDpS      13178      46603  21799310           1  3.722144  0.028351 

DegreeGroupedPermutation is complete while PathCount is still being populated. The current populate_database output is:

_download_hetionet_hetmat(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7f9c4f183828>) ran in 0:00:01
_hetionet_graph(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7f9c4f183828>) ran in 0:02:04
_populate_metanode_table() ran in 0:00:00
_populate_node_table() ran in 0:00:13
_populate_metapath_table() ran in 0:00:01
_download_path_counts(length=1) ran in 0:00:01
_populate_degree_grouped_permutation_table(length=1) ran in 0:02:09
_download_path_counts(length=2) ran in 0:03:46
_populate_degree_grouped_permutation_table(length=2) ran in 0:19:23
_download_path_counts(length=3) ran in 7:02:04
_populate_degree_grouped_permutation_table(length=3) ran in 1:40:51
dhimmel commented 5 years ago

Here's a terminal recording showing the database data in action https://asciinema.org/a/5C2ydYST0MqaLWyRTvyGzynTg

dhimmel commented 5 years ago

It's been 17 days since my last post, and populate_database is still running. So it's been running 20 days, which is excessive. Furthermore, I am running out of hard drive space, since the database now takes up 703 GB (sudo du --summarize --human-readable database). Therefore, I'm going to kill the process and think of optimizations and database size reductions.

The largest table is the PathCount Table with 2 billion rows:

############################### PathCount Table ################################
2,124,630,639 rows

id metapath_id  source_id  target_id    dgp_id  path_count      dwpc   p_value
 1     CbGdDpS      13176      46700  21798438           1  2.325159  0.053367
 2     CbGdDpS      13176      46855  21798441           1  2.169464  0.074498
 3     CbGdDpS      13178      46600  21799313           2  3.541697  0.043412
 4     CbGdDpS      13178      46602  21799308           2  4.031587  0.015953
 5     CbGdDpS      13178      46603  21799310           1  3.722144  0.028351

I modified the database_info command to display what number of metapaths had been added to PathCount table:

1,386 completed metapaths of 2,205 total metapaths

So we were a bit over half way complete with metapaths (although the Gene by Gene metapaths likely account for most of the space).