rs-station / careless

Merge X-ray diffraction data with Wilson's priors, variational inference, and metadata
MIT License
16 stars 6 forks source link

possible Student T instability? #110

Closed DHekstra closed 1 month ago

DHekstra commented 1 year ago

See attached files. Performing two-step inference for data processed in CrystFEL by AP, Careless run by KIW. NLL term diverges. This seems to be the key part of the traceback:

`Traceback (most recent call last): File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/bin/careless", line 8, in sys.exit(main()) ^^^^^^ File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 9, in main run_careless(parser) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 53, in run_careless history = model.train_model( ^^^^^^^^^^^^^^^^^^ File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 173, in train_model _history = train_step((self, data)) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler raise e.with_traceback(filtered_tb) from None File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Detected at node 'variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal' defined at (most recent call last): File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/bin/careless", line 8, in sys.exit(main()) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 9, in main run_careless(parser) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 53, in run_careless history = model.train_model( File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 173, in train_model _history = train_step((self, data)) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 159, in train_step history = model.train_step((data,)) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/training.py", line 1050, in train_step y_pred = self(x, training=True) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler return fn(*args, kwargs) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/training.py", line 558, in call return super().call(*args, *kwargs) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler return fn(args, kwargs) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/base_layer.py", line 1145, in call outputs = call_fn(inputs, *args, kwargs) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler return fn(*args, *kwargs) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 121, in call z_f = self.surrogate_posterior.sample(self.mc_sample_size) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/surrogate_posteriors.py", line 50, in sample s = self.distribution.sample(args, kwargs) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/distribution.py", line 1205, in sample return self._call_sample_n(sample_shape, seed, kwargs) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/distribution.py", line 1182, in _call_sample_n samples = self._sample_n( File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/truncated_normal.py", line 251, in _sample_n return tf.random.stateless_parameterized_truncated_normal( Node: 'variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal' Detected at node 'variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal' defined at (most recent call last): File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/bin/careless", line 8, in sys.exit(main()) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 9, in main run_careless(parser) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 53, in run_careless history = model.train_model( File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 173, in train_model _history = train_step((self, data)) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 159, in train_step history = model.train_step((data,)) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/training.py", line 1050, in train_step y_pred = self(x, training=True) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler return fn(*args, *kwargs) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/training.py", line 558, in call return super().call(args, kwargs) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler return fn(*args, kwargs) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/base_layer.py", line 1145, in call outputs = call_fn(inputs, *args, *kwargs) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler return fn(args, kwargs) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 121, in call z_f = self.surrogate_posterior.sample(self.mc_sample_size) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/surrogate_posteriors.py", line 50, in sample s = self.distribution.sample(*args, kwargs) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/distribution.py", line 1205, in sample return self._call_sample_n(sample_shape, seed, kwargs) File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/distribution.py", line 1182, in _call_sample_n samples = self._sample_n( File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/truncated_normal.py", line 251, in _sample_n return tf.random.stateless_parameterized_truncated_normal( Node: 'variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal' 2 root error(s) found. (0) INVALID_ARGUMENT: Invalid parameters [[{{node variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal}}]] [[variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal/_14]] (1) INVALID_ARGUMENT: Invalid parameters [[{{node variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal}}]] 0 successful operations. 0 derived errors ignored. [Op:__inference_train_step_6249]`

careless_22576794.out.txt careless_22576794.err.txt inputs_params.log.txt slurm_script.txt

kmdalton commented 1 year ago

Did all of the failed runs include --refine-uncertainties?

You can you try increasing the --mc-samples to 20 if you have enough memory or decreasing the --learning-rate to 1e-4 if you don't.

DHekstra commented 1 year ago

yes, I think all successful and failed runs had --refine-uncertainties. I'll make both suggestions.

kmdalton commented 1 year ago

okay -- can you compile a list of parameters for which runs succeeded and failed? maybe the failures have something in common.

DHekstra commented 1 year ago

Tentatively the same problem as https://github.com/rs-station/careless/issues/61 which was resolved by https://github.com/rs-station/careless/pull/62. I'll report back.

kmdalton commented 1 year ago

not sure if this is causal, but the ev11 likelihood should be adjusted to used a shift in its bijectors for transformed variables: https://github.com/rs-station/careless/blob/5a1dbf5174c43fe4796b8d1e4f299f7fa3a268eb/careless/models/likelihoods/mono.py#L40

DHekstra commented 1 year ago

note to self:

kmdalton commented 1 year ago

@DHekstra , is it true that the common factor in failed training runs was not Student T but rather image layers?

DHekstra commented 1 year ago

yes, that is true. this batch of runs did not include a no-image layer "control". the no-image-layer case did complete without problems previously.

kmdalton commented 1 year ago

@DorisMai found a bug (https://github.com/rs-station/careless/pull/122) in the surrogate posteriors which could have been leading to numerical instability. After I do the next release, it'd be nice to see if your issues go away.

kmdalton commented 1 year ago

Okay, @DHekstra , please give version 0.3.5 a try when you have a chance.

kmdalton commented 1 month ago

I think this is fully addressed by #167 and #168. I'm closing this until we hear of numerical issues cropping up again.