talmolab / sleap

A deep learning framework for multi-animal pose tracking.
https://sleap.ai
Other
435 stars 97 forks source link

Inference breaks in TensorFlow 2.10 #1721

Closed talmo closed 7 months ago

talmo commented 7 months ago

Bug description

As part of an experimental move to the latest version of TensorFlow available on Windows (v2.10), we are now facing an issue during inference.

The logs below reveal that we're getting a weird error when using find_global_peaks_rough, specifically on this line:

https://github.com/talmolab/sleap/blob/eb147646a79d057b508d7cbfa8f4c5e158601104/sleap/nn/peak_finding.py#L224

The exception (below) is hinting at the modulo operation being the problem. There's obviously mathematical workarounds that avoid the modulo operation, but it'd be good to dig into it.

In the meantime, I've deleted the conda package in our Anaconda channel since it was getting downloaded even for the stable release (v1.3.3) since we version fenced SLEAP permissively to allow newer versions of TensorFlow. This issue may have affected a small number of users who installed SLEAP since I pushed that conda package though (~100-200).

If we need to rebuild that conda package for testing, we can just rerun the jobs in this workflow to rebuild and reupload TensorFlow v2.10 to our conda channel. We should probably change the tag from main to dev to prevent users from downloading the new release until it's fixed though.

The short term fix if others run into this is to just pip install tensorflow==2.7 and everything should work.

Actual behaviour

Inference breaks during evaluation or inference with a centered instance model, specifically during the call to find_global_peaks_rough:

    File "d:\sleap_develop\sleap\nn\inference.py", line 2265, in call
      if isinstance(self.instance_peaks, FindInstancePeaksGroundTruth):
    File "d:\sleap_develop\sleap\nn\inference.py", line 2274, in call
      peaks_output = self.instance_peaks(crop_output)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "d:\sleap_develop\sleap\nn\inference.py", line 2110, in call
      if self.offsets_ind is None:
    File "d:\sleap_develop\sleap\nn\inference.py", line 2112, in call
      peak_points, peak_vals = sleap.nn.peak_finding.find_global_peaks(
    File "d:\sleap_develop\sleap\nn\peak_finding.py", line 366, in find_global_peaks
      rough_peaks, peak_vals = find_global_peaks_rough(
    File "d:\sleap_develop\sleap\nn\peak_finding.py", line 224, in find_global_peaks_rough
      channel_subs = tf.range(total_peaks, dtype=tf.int64) % channels
Node: 'mod'
2 root error(s) found.
  (0) UNKNOWN:  JIT compilation failed.
         [[{{node mod}}]]
         [[top_down_inference_model/find_instance_peaks_1/RaggedFromValueRowIds_1/RowPartitionFromValueRowIds/bincount/Minimum/_436]]
  (1) UNKNOWN:  JIT compilation failed.
         [[{{node mod}}]]

Your personal set up

Environment packages ``` # paste output of `pip freeze` or `conda list` here ```
Logs ``` Epoch 5/5 Polling: C:/Users/Talmo/Desktop/fun with nick\models\240322_132344.centroid.n=1\viz\validation.*.png Polling: C:/Users/Talmo/Desktop/fun with nick\models\240322_132513.centered_instance.n=1\viz\validation.*.png 2024-03-22 13:26:24.427056: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: 1 } dim { size: 768 } dim { size: 1024 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -2 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "AuthenticAMD" model: "241" frequency: 3493 num_cores: 64 environment { key: "cpu_instruction_set" value: "SSE, SSE2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 32768 l2_cache_size: 524288 l3_cache_size: 134217728 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 160 } dim { size: 160 } dim { size: 1 } } } 200/200 - 9s - loss: 4.0422e-04 - head: 7.4481e-04 - torso: 1.1064e-04 - tail_base: 3.5720e-04 - val_loss: 3.9666e-04 - val_head: 6.5480e-04 - val_torso: 1.2842e-04 - val_tail_base: 4.0678e-04 - lr: 1.0000e-04 - 9s/epoch - 47ms/step INFO:sleap.nn.training:Finished training loop. [0.9 min] INFO:sleap.nn.training:Deleting visualization directory: C:/Users/Talmo/Desktop/fun with nick\models\240322_132513.centered_instance.n=1\viz INFO:sleap.nn.training:Saving evaluation metrics to model folder... Predicting... ---------------------------------------- 0% ETA: -:--:-- ?Polling: C:/Users/Talmo/Desktop/fun with nick\models\240322_132344.centroid.n=1\viz\validation.*.png Polling: C:/Users/Talmo/Desktop/fun with nick\models\240322_132513.centered_instance.n=1\viz\validation.*.png 2024-03-22 13:26:28.048155: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_UINT8 } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_UINT8 shape { dim { size: 1 } dim { size: 768 } dim { size: 1024 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -3 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -3 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "AuthenticAMD" model: "241" frequency: 3493 num_cores: 64 environment { key: "cpu_instruction_set" value: "SSE, SSE2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 32768 l2_cache_size: 524288 l3_cache_size: 134217728 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -3 } dim { size: -23 } dim { size: -24 } dim { size: 1 } } } 2024-03-22 13:26:28.058356: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -38 } dim { size: -39 } dim { size: -40 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -11 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -11 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A6000" frequency: 1800 num_cores: 84 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 102400 memory_size: 48113909760 bandwidth: 768096000 } outputs { dtype: DT_FLOAT shape { dim { size: -11 } dim { size: -42 } dim { size: -43 } dim { size: 1 } } } error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice 2024-03-22 13:26:28.906549: W tensorflow/core/framework/op_kernel.cc:1768] UNKNOWN: JIT compilation failed. Predicting... ---------------------------------------- 0% ETA: -:--:-- ? Traceback (most recent call last): File "C:\Miniconda3\envs\sleap_develop\Scripts\sleap-train-script.py", line 33, in sys.exit(load_entry_point('sleap', 'console_scripts', 'sleap-train')()) File "d:\sleap_develop\sleap\nn\training.py", line 2014, in main trainer.train() File "d:\sleap_develop\sleap\nn\training.py", line 953, in train self.evaluate() File "d:\sleap_develop\sleap\nn\training.py", line 966, in evaluate split_name="train", File "d:\sleap_develop\sleap\nn\evals.py", line 744, in evaluate_model labels_pr: Labels = predictor.predict(labels_gt, make_labels=True) File "d:\sleap_develop\sleap\nn\inference.py", line 526, in predict self._make_labeled_frames_from_generator(generator, data) File "d:\sleap_develop\sleap\nn\inference.py", line 2642, in _make_labeled_frames_from_generator for ex in generator: File "d:\sleap_develop\sleap\nn\inference.py", line 436, in _predict_generator ex = process_batch(ex) File "d:\sleap_develop\sleap\nn\inference.py", line 399, in process_batch preds = self.inference_model.predict_on_batch(ex, numpy=True) File "d:\sleap_develop\sleap\nn\inference.py", line 1069, in predict_on_batch outs = super().predict_on_batch(data, **kwargs) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2474, in predict_on_batch outputs = self.predict_function(iterator) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler raise e.with_traceback(filtered_tb) from None File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\tensorflow\python\eager\execute.py", line 55, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.UnknownError: Graph execution error: Detected at node 'mod' defined at (most recent call last): File "C:\Miniconda3\envs\sleap_develop\Scripts\sleap-train-script.py", line 33, in sys.exit(load_entry_point('sleap', 'console_scripts', 'sleap-train')()) File "d:\sleap_develop\sleap\nn\training.py", line 2014, in main trainer.train() File "d:\sleap_develop\sleap\nn\training.py", line 953, in train self.evaluate() File "d:\sleap_develop\sleap\nn\training.py", line 966, in evaluate split_name="train", File "d:\sleap_develop\sleap\nn\evals.py", line 744, in evaluate_model labels_pr: Labels = predictor.predict(labels_gt, make_labels=True) File "d:\sleap_develop\sleap\nn\inference.py", line 526, in predict self._make_labeled_frames_from_generator(generator, data) File "d:\sleap_develop\sleap\nn\inference.py", line 2642, in _make_labeled_frames_from_generator for ex in generator: File "d:\sleap_develop\sleap\nn\inference.py", line 436, in _predict_generator ex = process_batch(ex) File "d:\sleap_develop\sleap\nn\inference.py", line 399, in process_batch preds = self.inference_model.predict_on_batch(ex, numpy=True) File "d:\sleap_develop\sleap\nn\inference.py", line 1069, in predict_on_batch outs = super().predict_on_batch(data, **kwargs) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2474, in predict_on_batch outputs = self.predict_function(iterator) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2041, in predict_function return step_function(self, iterator) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2027, in step_function outputs = model.distribute_strategy.run(run_step, args=(data,)) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2015, in run_step outputs = model.predict_step(data) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 1983, in predict_step return self(x, training=False) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler return fn(*args, **kwargs) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 557, in __call__ return super().__call__(*args, **kwargs) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler return fn(*args, **kwargs) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__ outputs = call_fn(inputs, *args, **kwargs) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler return fn(*args, **kwargs) File "d:\sleap_develop\sleap\nn\inference.py", line 2265, in call if isinstance(self.instance_peaks, FindInstancePeaksGroundTruth): File "d:\sleap_develop\sleap\nn\inference.py", line 2274, in call peaks_output = self.instance_peaks(crop_output) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler return fn(*args, **kwargs) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__ outputs = call_fn(inputs, *args, **kwargs) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler return fn(*args, **kwargs) File "d:\sleap_develop\sleap\nn\inference.py", line 2110, in call if self.offsets_ind is None: File "d:\sleap_develop\sleap\nn\inference.py", line 2112, in call peak_points, peak_vals = sleap.nn.peak_finding.find_global_peaks( File "d:\sleap_develop\sleap\nn\peak_finding.py", line 366, in find_global_peaks rough_peaks, peak_vals = find_global_peaks_rough( File "d:\sleap_develop\sleap\nn\peak_finding.py", line 224, in find_global_peaks_rough channel_subs = tf.range(total_peaks, dtype=tf.int64) % channels Node: 'mod' Detected at node 'mod' defined at (most recent call last): File "C:\Miniconda3\envs\sleap_develop\Scripts\sleap-train-script.py", line 33, in sys.exit(load_entry_point('sleap', 'console_scripts', 'sleap-train')()) File "d:\sleap_develop\sleap\nn\training.py", line 2014, in main trainer.train() File "d:\sleap_develop\sleap\nn\training.py", line 953, in train self.evaluate() File "d:\sleap_develop\sleap\nn\training.py", line 966, in evaluate split_name="train", File "d:\sleap_develop\sleap\nn\evals.py", line 744, in evaluate_model labels_pr: Labels = predictor.predict(labels_gt, make_labels=True) File "d:\sleap_develop\sleap\nn\inference.py", line 526, in predict self._make_labeled_frames_from_generator(generator, data) File "d:\sleap_develop\sleap\nn\inference.py", line 2642, in _make_labeled_frames_from_generator for ex in generator: File "d:\sleap_develop\sleap\nn\inference.py", line 436, in _predict_generator ex = process_batch(ex) File "d:\sleap_develop\sleap\nn\inference.py", line 399, in process_batch preds = self.inference_model.predict_on_batch(ex, numpy=True) File "d:\sleap_develop\sleap\nn\inference.py", line 1069, in predict_on_batch outs = super().predict_on_batch(data, **kwargs) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2474, in predict_on_batch outputs = self.predict_function(iterator) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2041, in predict_function return step_function(self, iterator) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2027, in step_function outputs = model.distribute_strategy.run(run_step, args=(data,)) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2015, in run_step outputs = model.predict_step(data) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 1983, in predict_step return self(x, training=False) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler return fn(*args, **kwargs) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 557, in __call__ return super().__call__(*args, **kwargs) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler return fn(*args, **kwargs) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__ outputs = call_fn(inputs, *args, **kwargs) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler return fn(*args, **kwargs) File "d:\sleap_develop\sleap\nn\inference.py", line 2265, in call if isinstance(self.instance_peaks, FindInstancePeaksGroundTruth): File "d:\sleap_develop\sleap\nn\inference.py", line 2274, in call peaks_output = self.instance_peaks(crop_output) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler return fn(*args, **kwargs) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__ outputs = call_fn(inputs, *args, **kwargs) File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler return fn(*args, **kwargs) File "d:\sleap_develop\sleap\nn\inference.py", line 2110, in call if self.offsets_ind is None: File "d:\sleap_develop\sleap\nn\inference.py", line 2112, in call peak_points, peak_vals = sleap.nn.peak_finding.find_global_peaks( File "d:\sleap_develop\sleap\nn\peak_finding.py", line 366, in find_global_peaks rough_peaks, peak_vals = find_global_peaks_rough( File "d:\sleap_develop\sleap\nn\peak_finding.py", line 224, in find_global_peaks_rough channel_subs = tf.range(total_peaks, dtype=tf.int64) % channels Node: 'mod' 2 root error(s) found. (0) UNKNOWN: JIT compilation failed. [[{{node mod}}]] [[top_down_inference_model/find_instance_peaks_1/RaggedFromValueRowIds_1/RowPartitionFromValueRowIds/bincount/Minimum/_436]] (1) UNKNOWN: JIT compilation failed. [[{{node mod}}]] 0 successful operations. 0 derived errors ignored. [Op:__inference_predict_function_37158] INFO:sleap.nn.callbacks:Closing the reporter controller/context. INFO:sleap.nn.callbacks:Closing the training controller socket/context. Run Path: C:/Users/Talmo/Desktop/fun with nick\models\240322_132513.centered_instance.n=1 Saving config: C:\Users\Talmo/.sleap/1.3.3/preferences.yaml ```

Screenshots

How to reproduce

Run inference with a top-down model (specifically the centered instance portion).

talmo commented 7 months ago

Solution: Stay on TF 2.7 😢