Closed dhimmel closed 5 years ago
Using https://github.com/hetio/hetmatpy/commit/b4f82acbbbd66c90fc17da98e06588b3b10daecd, I received the following error:
_download_hetionet_hetmat(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7f6a4f9f3e10>) ran in 0:00:00
_hetionet_graph(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7f6a4f9f3e10>) ran in 0:01:11
_populate_metanode_table() ran in 0:00:00
_populate_node_table() ran in 0:00:06
_populate_metapath_table() ran in 0:00:00
_download_path_counts(length=1) ran in 0:00:00
Traceback (most recent call last):
File "manage.py", line 15, in <module>
execute_from_command_line(sys.argv)
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
utility.execute()
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/django/core/management/__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/django/core/management/base.py", line 316, in run_from_argv
self.execute(*args, **cmd_options)
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/django/core/management/base.py", line 353, in execute
output = self.handle(*args, **options)
File "/home/dhimmel/Documents/greene/hetmech-backend/dj_hetmech_app/management/commands/populate_database.py", line 295, in handle
timed(self._populate_degree_grouped_permutation_table)(length)
File "/home/dhimmel/Documents/greene/hetmech-backend/dj_hetmech_app/utils.py", line 15, in wrapper
result = func(*args, **kwargs)
File "/home/dhimmel/Documents/greene/hetmech-backend/dj_hetmech_app/management/commands/populate_database.py", line 192, in _populate_degree_grouped_permutation_table
dgp_df = hetmatpy.pipeline.add_gamma_hurdle_to_dgp_df(dgp_df)
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/hetmatpy/pipeline.py", line 52, in add_gamma_hurdle_to_dgp_df
dgp_df['sd_nz'] = dgp_df[['sum_of_squares', 'sum', 'nnz']].apply(lambda row: calculate_sd(*row), raw=True, axis=1)
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/pandas/core/frame.py", line 6487, in apply
return op.get_result()
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/pandas/core/apply.py", line 151, in get_result
return self.apply_standard()
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/pandas/core/apply.py", line 257, in apply_standard
self.apply_series_generator()
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/pandas/core/apply.py", line 286, in apply_series_generator
results[i] = self.f(v)
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/hetmatpy/pipeline.py", line 52, in <lambda>
dgp_df['sd_nz'] = dgp_df[['sum_of_squares', 'sum', 'nnz']].apply(lambda row: calculate_sd(*row), raw=True, axis=1)
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/hetmatpy/pipeline.py", line 22, in calculate_sd
squared_deviations = sum_of_squares - unsquared_sum ** 2 / number_nonzero
ZeroDivisionError: ('float division by zero', 'occurred at index 0')
CC @ben-heil: looks like we're not accounting for the possibility of number_nonzero=0 in calculate_sd
.
Update: fixed this error in https://github.com/hetio/hetmatpy/pull/8 / https://github.com/hetio/hetmatpy/commit/96f87f78afa6fa271c20245fb025dbe92dc84d37
Using the code from https://github.com/greenelab/hetmech-backend/pull/18/commits/23333adddc4e431c1eaec176bda7a2608ca5e8b1, the database import progressed further but hit another snag:
_download_hetionet_hetmat(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7f3c51956e80>) ran in 0:00:00
_hetionet_graph(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7f3c51956e80>) ran in 0:01:10
_populate_metanode_table() ran in 0:00:00
_populate_node_table() ran in 0:00:05
_populate_metapath_table() ran in 0:00:00
_download_path_counts(length=1) ran in 0:00:00
_populate_degree_grouped_permutation_table(length=1) ran in 0:00:00
_download_path_counts(length=2) ran in 0:00:00
_populate_degree_grouped_permutation_table(length=2) ran in 0:00:04
_download_path_counts(length=3) ran in 0:00:02
_populate_degree_grouped_permutation_table(length=3) ran in 0:00:37
Traceback (most recent call last):
File "manage.py", line 15, in <module>
execute_from_command_line(sys.argv)
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
utility.execute()
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/django/core/management/__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/django/core/management/base.py", line 316, in run_from_argv
self.execute(*args, **cmd_options)
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/django/core/management/base.py", line 353, in execute
output = self.handle(*args, **options)
File "/home/dhimmel/Documents/greene/hetmech-backend/dj_hetmech_app/management/commands/populate_database.py", line 296, in handle
timed(self._populate_path_count_table)()
File "/home/dhimmel/Documents/greene/hetmech-backend/dj_hetmech_app/utils.py", line 15, in wrapper
result = func(*args, **kwargs)
File "/home/dhimmel/Documents/greene/hetmech-backend/dj_hetmech_app/management/commands/populate_database.py", line 252, in _populate_path_count_table
for row in rows:
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/hetmatpy/pipeline.py", line 127, in combine_dwpc_dgp
row['p_value'] = calculate_p_value(row)
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/hetmatpy/pipeline.py", line 107, in calculate_p_value
return calculate_empirical_p_value(row)
File "/home/dhimmel/anaconda3/envs/hetmech-backend/lib/python3.7/site-packages/hetmatpy/pipeline.py", line 94, in calculate_empirical_p_value
return row['nnz'] / row['n_dwpcs']
KeyError: 'n_dwpcs'
KeyError is occurring in this line.
'n' is called 'n_dwpcs' in the database, so I assumed it would be called that in the row too, sorry
We noticed something unusual, the n_dwpcs KeyError doesn't occur until the database starts populating length 3 metapaths, but we'd expect the calculate_empirical_p_value function to be called almost every time on the length 1 paths.
We noticed something unusual, the n_dwpcs KeyError doesn't occur until the database starts populating length 3 metapaths
Upon closer inspection, this was my misunderstanding. Looking at the output closer, the populate command only completes _populate_degree_grouped_permutation_table
for paths up to length 3 and then fails upon _populate_path_count_table
with length 1. Great!
Database repopulated succeeded with the following output
>>> python manage.py populate_database --max-metapath-length=3 --reduced-metapaths --batch-size=12000
_download_hetionet_hetmat(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7fca93b05400>) ran in 0:00:00
_hetionet_graph(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7fca93b05400>) ran in 0:01:13
_populate_metanode_table() ran in 0:00:00
_populate_node_table() ran in 0:00:06
_populate_metapath_table() ran in 0:00:00
_download_path_counts(length=1) ran in 0:00:00
_populate_degree_grouped_permutation_table(length=1) ran in 0:00:00
_download_path_counts(length=2) ran in 0:00:00
_populate_degree_grouped_permutation_table(length=2) ran in 0:00:04
_download_path_counts(length=3) ran in 0:00:00
_populate_degree_grouped_permutation_table(length=3) ran in 0:00:39
_populate_path_count_table() ran in 0:19:49
@ben-heil I'll merge this as the import succeeded. Let's get a pathology report to see if there are any red flags still in our dgp and path count tables.
We currently have two p_values that are getting set to NaN. I suspect this is because there isn't a condition to look for NaN standard deviations in hetmatpy/pipeline.py. I'll check on it and make a PR as needed
Wrote a PR that addresses this issue at https://github.com/hetio/hetmatpy/pull/12
Repopulating the database with these commands in the updated environment, after flushing the database as described in https://github.com/greenelab/hetmech-backend/issues/16#issuecomment-469055603: