Closed DHekstra closed 1 month ago
Did all of the failed runs include --refine-uncertainties
?
You can you try increasing the --mc-samples
to 20 if you have enough memory or decreasing the --learning-rate
to 1e-4 if you don't.
yes, I think all successful and failed runs had --refine-uncertainties
.
I'll make both suggestions.
okay -- can you compile a list of parameters for which runs succeeded and failed? maybe the failures have something in common.
Tentatively the same problem as https://github.com/rs-station/careless/issues/61 which was resolved by https://github.com/rs-station/careless/pull/62. I'll report back.
not sure if this is causal, but the ev11 likelihood should be adjusted to used a shift in its bijectors for transformed variables: https://github.com/rs-station/careless/blob/5a1dbf5174c43fe4796b8d1e4f299f7fa3a268eb/careless/models/likelihoods/mono.py#L40
note to self:
--epsilon
may help.@DHekstra , is it true that the common factor in failed training runs was not Student T but rather image layers?
yes, that is true. this batch of runs did not include a no-image layer "control". the no-image-layer case did complete without problems previously.
@DorisMai found a bug (https://github.com/rs-station/careless/pull/122) in the surrogate posteriors which could have been leading to numerical instability. After I do the next release, it'd be nice to see if your issues go away.
Okay, @DHekstra , please give version 0.3.5 a try when you have a chance.
I think this is fully addressed by #167 and #168. I'm closing this until we hear of numerical issues cropping up again.
See attached files. Performing two-step inference for data processed in CrystFEL by AP, Careless run by KIW. NLL term diverges. This seems to be the key part of the traceback:
`Traceback (most recent call last): File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/bin/careless", line 8, in
sys.exit(main())
^^^^^^
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 9, in main
run_careless(parser)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 53, in run_careless
history = model.train_model(
^^^^^^^^^^^^^^^^^^
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 173, in train_model
_history = train_step((self, data))
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:
Detected at node 'variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal' defined at (most recent call last): File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/bin/careless", line 8, in
sys.exit(main())
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 9, in main
run_careless(parser)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 53, in run_careless
history = model.train_model(
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 173, in train_model
_history = train_step((self, data))
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 159, in train_step
history = model.train_step((data,))
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/training.py", line 1050, in train_step
y_pred = self(x, training=True)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/training.py", line 558, in call
return super().call(*args, *kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(args, kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/base_layer.py", line 1145, in call
outputs = call_fn(inputs, *args, kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
return fn(*args, *kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 121, in call
z_f = self.surrogate_posterior.sample(self.mc_sample_size)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/surrogate_posteriors.py", line 50, in sample
s = self.distribution.sample(args, kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/distribution.py", line 1205, in sample
return self._call_sample_n(sample_shape, seed, kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/distribution.py", line 1182, in _call_sample_n
samples = self._sample_n(
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/truncated_normal.py", line 251, in _sample_n
return tf.random.stateless_parameterized_truncated_normal(
Node: 'variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal'
Detected at node 'variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal' defined at (most recent call last):
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/bin/careless", line 8, in
sys.exit(main())
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 9, in main
run_careless(parser)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 53, in run_careless
history = model.train_model(
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 173, in train_model
_history = train_step((self, data))
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 159, in train_step
history = model.train_step((data,))
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/training.py", line 1050, in train_step
y_pred = self(x, training=True)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, *kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/training.py", line 558, in call
return super().call(args, kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/base_layer.py", line 1145, in call
outputs = call_fn(inputs, *args, *kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
return fn(args, kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 121, in call
z_f = self.surrogate_posterior.sample(self.mc_sample_size)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/surrogate_posteriors.py", line 50, in sample
s = self.distribution.sample(*args, kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/distribution.py", line 1205, in sample
return self._call_sample_n(sample_shape, seed, kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/distribution.py", line 1182, in _call_sample_n
samples = self._sample_n(
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/truncated_normal.py", line 251, in _sample_n
return tf.random.stateless_parameterized_truncated_normal(
Node: 'variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal'
2 root error(s) found.
(0) INVALID_ARGUMENT: Invalid parameters
[[{{node variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal}}]]
[[variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal/_14]]
(1) INVALID_ARGUMENT: Invalid parameters
[[{{node variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_step_6249]`
careless_22576794.out.txt careless_22576794.err.txt inputs_params.log.txt slurm_script.txt