The exception (below) is hinting at the modulo operation being the problem. There's obviously mathematical workarounds that avoid the modulo operation, but it'd be good to dig into it.
In the meantime, I've deleted the conda package in our Anaconda channel since it was getting downloaded even for the stable release (v1.3.3) since we version fenced SLEAP permissively to allow newer versions of TensorFlow. This issue may have affected a small number of users who installed SLEAP since I pushed that conda package though (~100-200).
If we need to rebuild that conda package for testing, we can just rerun the jobs in this workflow to rebuild and reupload TensorFlow v2.10 to our conda channel. We should probably change the tag from main to dev to prevent users from downloading the new release until it's fixed though.
The short term fix if others run into this is to just pip install tensorflow==2.7 and everything should work.
Actual behaviour
Inference breaks during evaluation or inference with a centered instance model, specifically during the call to find_global_peaks_rough:
File "d:\sleap_develop\sleap\nn\inference.py", line 2265, in call
if isinstance(self.instance_peaks, FindInstancePeaksGroundTruth):
File "d:\sleap_develop\sleap\nn\inference.py", line 2274, in call
peaks_output = self.instance_peaks(crop_output)
File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
return fn(*args, **kwargs)
File "d:\sleap_develop\sleap\nn\inference.py", line 2110, in call
if self.offsets_ind is None:
File "d:\sleap_develop\sleap\nn\inference.py", line 2112, in call
peak_points, peak_vals = sleap.nn.peak_finding.find_global_peaks(
File "d:\sleap_develop\sleap\nn\peak_finding.py", line 366, in find_global_peaks
rough_peaks, peak_vals = find_global_peaks_rough(
File "d:\sleap_develop\sleap\nn\peak_finding.py", line 224, in find_global_peaks_rough
channel_subs = tf.range(total_peaks, dtype=tf.int64) % channels
Node: 'mod'
2 root error(s) found.
(0) UNKNOWN: JIT compilation failed.
[[{{node mod}}]]
[[top_down_inference_model/find_instance_peaks_1/RaggedFromValueRowIds_1/RowPartitionFromValueRowIds/bincount/Minimum/_436]]
(1) UNKNOWN: JIT compilation failed.
[[{{node mod}}]]
Bug description
As part of an experimental move to the latest version of TensorFlow available on Windows (v2.10), we are now facing an issue during inference.
The logs below reveal that we're getting a weird error when using
find_global_peaks_rough
, specifically on this line:https://github.com/talmolab/sleap/blob/eb147646a79d057b508d7cbfa8f4c5e158601104/sleap/nn/peak_finding.py#L224
The exception (below) is hinting at the modulo operation being the problem. There's obviously mathematical workarounds that avoid the modulo operation, but it'd be good to dig into it.
In the meantime, I've deleted the conda package in our Anaconda channel since it was getting downloaded even for the stable release (v1.3.3) since we version fenced SLEAP permissively to allow newer versions of TensorFlow. This issue may have affected a small number of users who installed SLEAP since I pushed that conda package though (~100-200).
If we need to rebuild that conda package for testing, we can just rerun the jobs in this workflow to rebuild and reupload TensorFlow v2.10 to our conda channel. We should probably change the tag from
main
todev
to prevent users from downloading the new release until it's fixed though.The short term fix if others run into this is to just
pip install tensorflow==2.7
and everything should work.Actual behaviour
Inference breaks during evaluation or inference with a centered instance model, specifically during the call to
find_global_peaks_rough
:Your personal set up
eb14764
Environment packages
``` # paste output of `pip freeze` or `conda list` here ```Logs
``` Epoch 5/5 Polling: C:/Users/Talmo/Desktop/fun with nick\models\240322_132344.centroid.n=1\viz\validation.*.png Polling: C:/Users/Talmo/Desktop/fun with nick\models\240322_132513.centered_instance.n=1\viz\validation.*.png 2024-03-22 13:26:24.427056: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: 1 } dim { size: 768 } dim { size: 1024 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -2 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "AuthenticAMD" model: "241" frequency: 3493 num_cores: 64 environment { key: "cpu_instruction_set" value: "SSE, SSE2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 32768 l2_cache_size: 524288 l3_cache_size: 134217728 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 160 } dim { size: 160 } dim { size: 1 } } } 200/200 - 9s - loss: 4.0422e-04 - head: 7.4481e-04 - torso: 1.1064e-04 - tail_base: 3.5720e-04 - val_loss: 3.9666e-04 - val_head: 6.5480e-04 - val_torso: 1.2842e-04 - val_tail_base: 4.0678e-04 - lr: 1.0000e-04 - 9s/epoch - 47ms/step INFO:sleap.nn.training:Finished training loop. [0.9 min] INFO:sleap.nn.training:Deleting visualization directory: C:/Users/Talmo/Desktop/fun with nick\models\240322_132513.centered_instance.n=1\viz INFO:sleap.nn.training:Saving evaluation metrics to model folder... Predicting... ---------------------------------------- 0% ETA: -:--:-- ?Polling: C:/Users/Talmo/Desktop/fun with nick\models\240322_132344.centroid.n=1\viz\validation.*.png Polling: C:/Users/Talmo/Desktop/fun with nick\models\240322_132513.centered_instance.n=1\viz\validation.*.png 2024-03-22 13:26:28.048155: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_UINT8 } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_UINT8 shape { dim { size: 1 } dim { size: 768 } dim { size: 1024 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -3 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -3 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "AuthenticAMD" model: "241" frequency: 3493 num_cores: 64 environment { key: "cpu_instruction_set" value: "SSE, SSE2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 32768 l2_cache_size: 524288 l3_cache_size: 134217728 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -3 } dim { size: -23 } dim { size: -24 } dim { size: 1 } } } 2024-03-22 13:26:28.058356: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -38 } dim { size: -39 } dim { size: -40 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -11 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -11 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A6000" frequency: 1800 num_cores: 84 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 102400 memory_size: 48113909760 bandwidth: 768096000 } outputs { dtype: DT_FLOAT shape { dim { size: -11 } dim { size: -42 } dim { size: -43 } dim { size: 1 } } } error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice 2024-03-22 13:26:28.906549: W tensorflow/core/framework/op_kernel.cc:1768] UNKNOWN: JIT compilation failed. Predicting... ---------------------------------------- 0% ETA: -:--:-- ? Traceback (most recent call last): File "C:\Miniconda3\envs\sleap_develop\Scripts\sleap-train-script.py", line 33, inScreenshots
How to reproduce
Run inference with a top-down model (specifically the centered instance portion).