greenelab / connectivity-search-backend

Django backend for hetnet connectivity search
https://search-api.het.io
BSD 3-Clause "New" or "Revised" License
6 stars 2 forks source link

Populate databases with fewer metapaths for prototyping #11

Closed dhimmel closed 5 years ago

dhimmel commented 5 years ago

With https://github.com/greenelab/hetmech-backend/pull/11/commits/484fd901a12c530ce814f161949a70519494456f, the import ran to the AWS prototype database with the following times:

_download_hetionet_hetmat(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7f554079dda0>) ran in 0:00:00
_hetionet_graph(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7f554079dda0>) ran in 0:02:17
_populate_metanode_table() ran in 0:00:00
_populate_node_table() ran in 0:00:14
_populate_metapath_table() ran in 0:00:00
_download_path_counts(length=1) ran in 0:00:00
_populate_degree_grouped_permutation_table(length=1) ran in 0:00:00
_download_path_counts(length=2) ran in 0:04:07
_populate_degree_grouped_permutation_table(length=2) ran in 0:00:07
_download_path_counts(length=3) ran in 0:52:28
_populate_degree_grouped_permutation_table(length=3) ran in 0:01:00
_populate_path_count_table() ran in 0:23:28

https://github.com/greenelab/hetmech-backend/pull/11/commits/c17c970761b3d729c356ac6635eaf5f4b1f84bc8 should speed this up even more.

At this point the output of python manage.py database_info is:

################################ Metanode Table ################################
11 rows

identifier abbreviation  n_nodes
           Anatomy            A      402
Biological Process           BP    11381
Cellular Component           CC     1391
          Compound            C     1552
           Disease            D      137 

################################## Node Table ##################################
47,031 rows

id metanode_id      identifier identifier_type                       name                                            url                                               data
 1     Anatomy  UBERON:0000002             str             uterine cervix  http://purl.obolibrary.org/obo/UBERON_0000002  {'url': 'http://purl.obolibrary.org/obo/UBERON...
 2     Anatomy  UBERON:0000004             str                       nose  http://purl.obolibrary.org/obo/UBERON_0000004  {'url': 'http://purl.obolibrary.org/obo/UBERON...
 3     Anatomy  UBERON:0000006             str        islet of Langerhans  http://purl.obolibrary.org/obo/UBERON_0000006  {'url': 'http://purl.obolibrary.org/obo/UBERON...
 4     Anatomy  UBERON:0000007             str            pituitary gland  http://purl.obolibrary.org/obo/UBERON_0000007  {'url': 'http://purl.obolibrary.org/obo/UBERON...
 5     Anatomy  UBERON:0000010             str  peripheral nervous system  http://purl.obolibrary.org/obo/UBERON_0000010  {'url': 'http://purl.obolibrary.org/obo/UBERON... 

################################ Metapath Table ################################
127 rows

abbreviation                                           name source_id target_id  length  path_count_density  path_count_mean  path_count_max  dwpc_raw_mean
        CpD                     Compound–palliates–Disease  Compound   Disease       1            0.001834         0.001834               1       0.000417
        CtD                        Compound–treats–Disease  Compound   Disease       1            0.003551         0.003551               1       0.000633
      CrCpD  Compound–resembles–Compound–palliates–Disease  Compound   Disease       2            0.014283         0.023083              12       0.000373
      CrCtD     Compound–resembles–Compound–treats–Disease  Compound   Disease       2            0.013766         0.031295              13       0.000473
      CpDrD   Compound–palliates–Disease–resembles–Disease  Compound   Disease       2            0.007826         0.009867               5       0.000384 

######################## DegreeGroupedPermutation Table ########################
267,814 rows

id metapath_id  source_degree  target_degree   n_dwpcs  n_nonzero_dwpcs  nonzero_mean  nonzero_sd
 1         CpD              0              0  23159400                0           NaN         NaN
 2         CpD              0              1   2662000                0           NaN         NaN
 3         CpD              0              2   2129600                0           NaN         NaN
 4         CpD              0              3    798600                0           NaN         NaN
 5         CpD              0              4    798600                0           NaN         NaN 

############################### PathCount Table ################################
806,115 rows

id metapath_id  source_id  target_id  dgp_id  path_count      dwpc   p_value
 1    CuGr>GdD      13175      14739  252716           2  2.785063  0.073267
 2    CuGr>GdD      13178      14729  254315         128  4.499498  0.041682
 3    CuGr>GdD      13178      14739  254318         312  4.817715  0.006500
 4    CuGr>GdD      13178      14773  254316         222  4.627762  0.015198
 5    CuGr>GdD      13178      14782  254318         408  4.965222  0.000510 

127 completed metapaths of 127 total metapaths
dhimmel commented 5 years ago

New method that specifies source_paths when calling load_archive is speeding up _download_path_counts quite a bit:

_download_hetionet_hetmat(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7f439ec572b0>) ran in 0:00:00
_hetionet_graph(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7f439ec572b0>) ran in 0:01:24
_populate_metanode_table() ran in 0:00:00
_populate_node_table() ran in 0:00:08
_populate_metapath_table() ran in 0:00:00
_download_path_counts(length=1) ran in 0:00:00
_populate_degree_grouped_permutation_table(length=1) ran in 0:00:00
_download_path_counts(length=2) ran in 0:00:00
_populate_degree_grouped_permutation_table(length=2) ran in 0:00:03
_download_path_counts(length=3) ran in 0:00:00
_populate_degree_grouped_permutation_table(length=3) ran in 0:00:31
_populate_path_count_table() ran in 0:16:27

Output from the database_info command is the same as above.