greenelab / connectivity-search-backend

Django backend for hetnet connectivity search
https://search-api.het.io
BSD 3-Clause "New" or "Revised" License
6 stars 2 forks source link

Repopulate reduced metapaths with improved hetmatpy pipeline #18

Closed dhimmel closed 5 years ago

dhimmel commented 5 years ago

Repopulating the database with these commands in the updated environment, after flushing the database as described in https://github.com/greenelab/hetmech-backend/issues/16#issuecomment-469055603:

python manage.py makemigrations
python manage.py migrate
python manage.py populate_database --max-metapath-length=3  --reduced-metapaths --batch-size=12000
dhimmel commented 5 years ago

Using https://github.com/hetio/hetmatpy/commit/b4f82acbbbd66c90fc17da98e06588b3b10daecd, I received the following error:

_download_hetionet_hetmat(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7f6a4f9f3e10>) ran in 0:00:00
_hetionet_graph(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7f6a4f9f3e10>) ran in 0:01:11
_populate_metanode_table() ran in 0:00:00
_populate_node_table() ran in 0:00:06
_populate_metapath_table() ran in 0:00:00
_download_path_counts(length=1) ran in 0:00:00
Traceback (most recent call last):
  File "manage.py", line 15, in <module>
    execute_from_command_line(sys.argv)
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
    utility.execute()
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/django/core/management/__init__.py", line 375, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/django/core/management/base.py", line 316, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/django/core/management/base.py", line 353, in execute
    output = self.handle(*args, **options)
  File "/home/dhimmel/Documents/greene/hetmech-backend/dj_hetmech_app/management/commands/populate_database.py", line 295, in handle
    timed(self._populate_degree_grouped_permutation_table)(length)
  File "/home/dhimmel/Documents/greene/hetmech-backend/dj_hetmech_app/utils.py", line 15, in wrapper
    result = func(*args, **kwargs)
  File "/home/dhimmel/Documents/greene/hetmech-backend/dj_hetmech_app/management/commands/populate_database.py", line 192, in _populate_degree_grouped_permutation_table
    dgp_df = hetmatpy.pipeline.add_gamma_hurdle_to_dgp_df(dgp_df)
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/hetmatpy/pipeline.py", line 52, in add_gamma_hurdle_to_dgp_df
    dgp_df['sd_nz'] = dgp_df[['sum_of_squares', 'sum', 'nnz']].apply(lambda row: calculate_sd(*row), raw=True, axis=1)
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/pandas/core/frame.py", line 6487, in apply
    return op.get_result()
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/pandas/core/apply.py", line 151, in get_result
    return self.apply_standard()
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/pandas/core/apply.py", line 257, in apply_standard
    self.apply_series_generator()
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/pandas/core/apply.py", line 286, in apply_series_generator
    results[i] = self.f(v)
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/hetmatpy/pipeline.py", line 52, in <lambda>
    dgp_df['sd_nz'] = dgp_df[['sum_of_squares', 'sum', 'nnz']].apply(lambda row: calculate_sd(*row), raw=True, axis=1)
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/hetmatpy/pipeline.py", line 22, in calculate_sd
    squared_deviations = sum_of_squares - unsquared_sum ** 2 / number_nonzero
ZeroDivisionError: ('float division by zero', 'occurred at index 0')

CC @ben-heil: looks like we're not accounting for the possibility of number_nonzero=0 in calculate_sd.

Update: fixed this error in https://github.com/hetio/hetmatpy/pull/8 / https://github.com/hetio/hetmatpy/commit/96f87f78afa6fa271c20245fb025dbe92dc84d37

dhimmel commented 5 years ago

Using the code from https://github.com/greenelab/hetmech-backend/pull/18/commits/23333adddc4e431c1eaec176bda7a2608ca5e8b1, the database import progressed further but hit another snag:

_download_hetionet_hetmat(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7f3c51956e80>) ran in 0:00:00
_hetionet_graph(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7f3c51956e80>) ran in 0:01:10
_populate_metanode_table() ran in 0:00:00
_populate_node_table() ran in 0:00:05
_populate_metapath_table() ran in 0:00:00
_download_path_counts(length=1) ran in 0:00:00
_populate_degree_grouped_permutation_table(length=1) ran in 0:00:00
_download_path_counts(length=2) ran in 0:00:00
_populate_degree_grouped_permutation_table(length=2) ran in 0:00:04
_download_path_counts(length=3) ran in 0:00:02
_populate_degree_grouped_permutation_table(length=3) ran in 0:00:37
Traceback (most recent call last):
  File "manage.py", line 15, in <module>
    execute_from_command_line(sys.argv)
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
    utility.execute()
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/django/core/management/__init__.py", line 375, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/django/core/management/base.py", line 316, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/django/core/management/base.py", line 353, in execute
    output = self.handle(*args, **options)
  File "/home/dhimmel/Documents/greene/hetmech-backend/dj_hetmech_app/management/commands/populate_database.py", line 296, in handle
    timed(self._populate_path_count_table)()
  File "/home/dhimmel/Documents/greene/hetmech-backend/dj_hetmech_app/utils.py", line 15, in wrapper
    result = func(*args, **kwargs)
  File "/home/dhimmel/Documents/greene/hetmech-backend/dj_hetmech_app/management/commands/populate_database.py", line 252, in _populate_path_count_table
    for row in rows:
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/hetmatpy/pipeline.py", line 127, in combine_dwpc_dgp
    row['p_value'] = calculate_p_value(row)
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/hetmatpy/pipeline.py", line 107, in calculate_p_value
    return calculate_empirical_p_value(row)
  File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/hetmatpy/pipeline.py", line 94, in calculate_empirical_p_value
    return row['nnz'] / row['n_dwpcs']
KeyError: 'n_dwpcs'

KeyError is occurring in this line.

ben-heil commented 5 years ago

'n' is called 'n_dwpcs' in the database, so I assumed it would be called that in the row too, sorry

ben-heil commented 5 years ago

We noticed something unusual, the n_dwpcs KeyError doesn't occur until the database starts populating length 3 metapaths, but we'd expect the calculate_empirical_p_value function to be called almost every time on the length 1 paths.

dhimmel commented 5 years ago

We noticed something unusual, the n_dwpcs KeyError doesn't occur until the database starts populating length 3 metapaths

Upon closer inspection, this was my misunderstanding. Looking at the output closer, the populate command only completes _populate_degree_grouped_permutation_table for paths up to length 3 and then fails upon _populate_path_count_table with length 1. Great!

dhimmel commented 5 years ago

Database repopulated succeeded with the following output

>>> python manage.py populate_database --max-metapath-length=3  --reduced-metapaths --batch-size=12000
_download_hetionet_hetmat(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7fca93b05400>) ran in 0:00:00
_hetionet_graph(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7fca93b05400>) ran in 0:01:13
_populate_metanode_table() ran in 0:00:00
_populate_node_table() ran in 0:00:06
_populate_metapath_table() ran in 0:00:00
_download_path_counts(length=1) ran in 0:00:00
_populate_degree_grouped_permutation_table(length=1) ran in 0:00:00
_download_path_counts(length=2) ran in 0:00:00
_populate_degree_grouped_permutation_table(length=2) ran in 0:00:04
_download_path_counts(length=3) ran in 0:00:00
_populate_degree_grouped_permutation_table(length=3) ran in 0:00:39
_populate_path_count_table() ran in 0:19:49
Database info ``` ################################ Metanode Table ################################ 11 rows identifier abbreviation n_nodes Anatomy A 402 Biological Process BP 11381 Cellular Component CC 1391 Compound C 1552 Disease D 137 ################################## Node Table ################################## 47,031 rows id metanode_id identifier identifier_type name url data 1 Anatomy UBERON:0000002 str uterine cervix http://purl.obolibrary.org/obo/UBERON_0000002 {'url': 'http://purl.obolibrary.org/obo/UBERON... 2 Anatomy UBERON:0000004 str nose http://purl.obolibrary.org/obo/UBERON_0000004 {'url': 'http://purl.obolibrary.org/obo/UBERON... 3 Anatomy UBERON:0000006 str islet of Langerhans http://purl.obolibrary.org/obo/UBERON_0000006 {'url': 'http://purl.obolibrary.org/obo/UBERON... 4 Anatomy UBERON:0000007 str pituitary gland http://purl.obolibrary.org/obo/UBERON_0000007 {'url': 'http://purl.obolibrary.org/obo/UBERON... 5 Anatomy UBERON:0000010 str peripheral nervous system http://purl.obolibrary.org/obo/UBERON_0000010 {'url': 'http://purl.obolibrary.org/obo/UBERON... ################################ Metapath Table ################################ 127 rows abbreviation name source_id target_id length path_count_density path_count_mean path_count_max dwpc_raw_mean CpD Compound–palliates–Disease Compound Disease 1 0.001834 0.001834 1 0.000417 CtD Compound–treats–Disease Compound Disease 1 0.003551 0.003551 1 0.000633 CrCpD Compound–resembles–Compound–palliates–Disease Compound Disease 2 0.014283 0.023083 12 0.000373 CrCtD Compound–resembles–Compound–treats–Disease Compound Disease 2 0.013766 0.031295 13 0.000473 CpDrD Compound–palliates–Disease–resembles–Disease Compound Disease 2 0.007826 0.009867 5 0.000384 ######################## DegreeGroupedPermutation Table ######################## 267,814 rows id metapath_id source_degree target_degree n_dwpcs n_nonzero_dwpcs nonzero_mean nonzero_sd 1 CpD 0 0 23159400 0 NaN NaN 2 CpD 0 1 2662000 0 NaN NaN 3 CpD 0 2 2129600 0 NaN NaN 4 CpD 0 3 798600 0 NaN NaN 5 CpD 0 4 798600 0 NaN NaN ############################### PathCount Table ################################ 805,763 rows id metapath_id source_id target_id dgp_id path_count dwpc p_value 1 CuGr>GdD 13175 14739 252716 2 2.785063 0.073267 2 CuGr>GdD 13178 14729 254315 128 4.499498 0.041682 3 CuGr>GdD 13178 14739 254318 312 4.817715 0.006500 4 CuGr>GdD 13178 14773 254316 222 4.627762 0.015198 5 CuGr>GdD 13178 14782 254318 408 4.965222 0.000510 127 completed metapaths of 127 total metapaths ```
dhimmel commented 5 years ago

@ben-heil I'll merge this as the import succeeded. Let's get a pathology report to see if there are any red flags still in our dgp and path count tables.

ben-heil commented 5 years ago

We currently have two p_values that are getting set to NaN. I suspect this is because there isn't a condition to look for NaN standard deviations in hetmatpy/pipeline.py. I'll check on it and make a PR as needed

screenshot from 2019-03-04 12-16-20

ben-heil commented 5 years ago

Wrote a PR that addresses this issue at https://github.com/hetio/hetmatpy/pull/12