Closed olivier-cuttlefish closed 1 year ago
Still looking into this, but as a quick update, I ran a test in the GUI by creating a .pkg.slp and then running training/inference on it (mainly to double check the command line call) which seems correct:
Command line call:
sleap-track /Users/liezlmaree/Projects/sleap-datasets/drosophila-melanogaster-courtship/courtship_labels.pkg.slp --only-suggested-frames -m /Users/liezlmaree/Projects/sleap-datasets/drosophila-melanogaster-courtship/models/231016_063130.centroid.n=149 -m /Users/liezlmaree/Projects/sleap-datasets/drosophila-melanogaster-courtship/models/231016_070843.centered_instance.n=149 -o /Users/liezlmaree/Projects/sleap-datasets/drosophila-melanogaster-courtship/predictions/courtship_labels.pkg.slp.231016_072541.predictions.slp --verbosity json --no-empty-frames
Started inference at: 2023-10-16 07:25:46.154717
Args:
{
│ 'data_path': '/Users/liezlmaree/Projects/sleap-datasets/drosophila-melanogaster-courtship/courtship_labels.pkg.slp',
│ 'models': [
│ │ '/Users/liezlmaree/Projects/sleap-datasets/drosophila-melanogaster-courtship/models/231016_063130.centroid.n=149',
│ │ '/Users/liezlmaree/Projects/sleap-datasets/drosophila-melanogaster-courtship/models/231016_070843.centered_instance.n=149'
│ ],
│ 'frames': '',
│ 'only_labeled_frames': False,
│ 'only_suggested_frames': True,
2023-10-16 07:25:46.723577: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-10-16 07:25:46.723760: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
│ 'output': '/Users/liezlmaree/Projects/sleap-datasets/drosophila-melanogaster-courtship/predictions/courtship_labels.pkg.slp.231016_072541.predictions.slp',
│ 'no_empty_frames': True,
│ 'verbosity': 'json',
│ 'video.dataset': None,
│ 'video.input_format': 'channels_last',
│ 'video.index': '',
│ 'cpu': False,
│ 'first_gpu': False,
│ 'last_gpu': False,
│ 'gpu': 'auto',
│ 'max_edge_length_ratio': 0.25,
│ 'dist_penalty_weight': 1.0,
│ 'batch_size': 4,
│ 'open_in_gui': False,
│ 'peak_threshold': 0.2,
│ 'max_instances': None,
│ 'tracking.tracker': None,
│ 'tracking.max_tracking': None,
│ 'tracking.max_tracks': None,
│ 'tracking.target_instance_count': None,
2023-10-16 07:25:47.767238: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
│ 'tracking.pre_cull_to_target': None,
│ 'tracking.pre_cull_iou_threshold': None,
│ 'tracking.post_connect_single_breaks': None,
│ 'tracking.clean_instance_count': None,
│ 'tracking.clean_iou_threshold': None,
│ 'tracking.similarity': None,
│ 'tracking.match': None,
│ 'tracking.robust': None,
│ 'tracking.track_window': None,
│ 'tracking.min_new_track_points': None,
│ 'tracking.min_match_points': None,
│ 'tracking.img_scale': None,
│ 'tracking.of_window_size': None,
│ 'tracking.of_max_levels': None,
│ 'tracking.save_shifted_instances': None,
│ 'tracking.kf_node_indices': None,
│ 'tracking.kf_init_frame_count': None
}
INFO:sleap.nn.inference:Failed to query GPU memory from nvidia-smi. Defaulting to first GPU.
Metal device set to: Apple M2 Pro
2023-10-16 07:25:49.822852: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-10-16 07:25:49.907230: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -45 } dim { size: -46 } dim { size: -47 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -15 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -15 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" model: "0" num_cores: 10 environment { key: "cpu_instruction_set" value: "ARM NEON" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 16384 l2_cache_size: 524288 l3_cache_size: 524288 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -15 } dim { size: -48 } dim { size: -49 } dim { size: 1 } } }
2023-10-16 07:25:49.907533: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_UINT8 } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_UINT8 shape { dim { size: 4 } dim { size: 1024 } dim { size: 1024 } dim { size: 3 } } } inputs { dtype: DT_FLOAT shape { dim { size: -15 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -15 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" model: "0" num_cores: 10 environment { key: "cpu_instruction_set" value: "ARM NEON" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 16384 l2_cache_size: 524288 l3_cache_size: 524288 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -15 } dim { size: -56 } dim { size: -57 } dim { size: 3 } } }
2023-10-16 07:25:49.911132: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -91 } dim { size: -92 } dim { size: -93 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -20 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -20 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" model: "0" num_cores: 10 environment { key: "cpu_instruction_set" value: "ARM NEON" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 16384 l2_cache_size: 524288 l3_cache_size: 524288 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -20 } dim { size: -95 } dim { size: -96 } dim { size: 1 } } }
Versions:
SLEAP: 1.3.3
TensorFlow: 2.9.2
Numpy: 1.22.3
Python: 3.9.15
OS: macOS-13.5-arm64-arm-64bit
System:
GPUs: 1/1 available
Device: /physical_device:GPU:0
Available: True
Initalized: False
Memory growth: True
Finished inference at: 2023-10-16 07:25:50.748722
Total runtime: 4.5940141677856445 secs
Predicted frames: 20/20
Process return code: 0
skipped 98 redundant instances
Thank you very much for your quick response. I tried to run it locally and indeed managed to make it work:
sleap-track -m "models/231013_164017.multi_instance" --only-labeled-frames -o "labels002_predictions.slp" "labels.v002.merged.pkg.slp"
Started inference at: 2023-10-17 10:09:25.229688
Args:
{
│ 'data_path': 'labels.v002.merged.pkg.slp',
│ 'models': ['models/231013_164017.multi_instance'],
│ 'frames': '',
│ 'only_labeled_frames': True,
│ 'only_suggested_frames': False,
│ 'output': 'labels002_predictions.slp',
│ 'no_empty_frames': False,
│ 'verbosity': 'rich',
│ 'video.dataset': None,
│ 'video.input_format': 'channels_last',
│ 'video.index': '',
│ 'cpu': False,
│ 'first_gpu': False,
│ 'last_gpu': False,
│ 'gpu': 'auto',
│ 'max_edge_length_ratio': 0.25,
│ 'dist_penalty_weight': 1.0,
│ 'batch_size': 4,
│ 'open_in_gui': False,
│ 'peak_threshold': 0.2,
│ 'max_instances': None,
│ 'tracking.tracker': None,
│ 'tracking.max_tracking': None,
│ 'tracking.max_tracks': None,
│ 'tracking.target_instance_count': None,
│ 'tracking.pre_cull_to_target': None,
│ 'tracking.pre_cull_iou_threshold': None,
│ 'tracking.post_connect_single_breaks': None,
│ 'tracking.clean_instance_count': None,
│ 'tracking.clean_iou_threshold': None,
│ 'tracking.similarity': None,
│ 'tracking.match': None,
│ 'tracking.robust': None,
│ 'tracking.track_window': None,
│ 'tracking.min_new_track_points': None,
│ 'tracking.min_match_points': None,
│ 'tracking.img_scale': None,
│ 'tracking.of_window_size': None,
│ 'tracking.of_max_levels': None,
│ 'tracking.save_shifted_instances': None,
│ 'tracking.kf_node_indices': None,
│ 'tracking.kf_init_frame_count': None
}
2023-10-17 10:09:25.261963: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-17 10:09:25.265508: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-17 10:09:25.265614: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
INFO:sleap.nn.inference:Auto-selected GPU 0 with 10980 MiB of free memory.
Versions:
SLEAP: 1.3.3
TensorFlow: 2.7.0
Numpy: 1.19.5
Python: 3.7.12
OS: Linux-5.15.0-84-generic-x86_64-with-debian-bullseye-sid
System:
GPUs: 1/1 available
Device: /physical_device:GPU:0
Available: True
Initalized: False
Memory growth: True
2023-10-17 10:09:25.937969: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-17 10:09:25.939208: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-17 10:09:25.939381: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-17 10:09:25.939471: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-17 10:09:26.215092: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-17 10:09:26.215278: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-17 10:09:26.215386: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-17 10:09:26.215463: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9172 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3060, pci bus id: 0000:01:00.0, compute capability: 8.6
Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% ETA: -:--:-- ?2023-10-17 10:09:38.259870: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8201
2023-10-17 10:09:39.178450: W tensorflow/core/common_runtime/bfc_allocator.cc:343] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2023-10-17 10:09:41.077182: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.95GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2023-10-17 10:09:41.077213: W tensorflow/core/kernels/gpu_utils.cc:49] Failed to allocate memory for convolution redzone checking; skipping this check. This is benign and only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.
Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% ETA: 0:00:00 ?
Finished inference at: 2023-10-17 10:10:57.183653
Total runtime: 91.95398092269897 secs
Predicted frames: 125/125
Provenance:
{
│ 'model_paths': ['models/231013_164017.multi_instance/training_config.json'],
│ 'predictor': 'BottomUpPredictor',
│ 'sleap_version': '1.3.3',
│ 'platform': 'Linux-5.15.0-84-generic-x86_64-with-debian-bullseye-sid',
│ 'command': '/home/xxx/mambaforge/envs/sleap133/bin/sleap-track -m models/231013_164017.multi_instance --only-labeled-frames -o labels002_predictions.slp labels.v002.merged.pkg.slp',
│ 'data_path': 'labels.v002.merged.pkg.slp',
│ 'output_path': 'labels002_predictions.slp',
│ 'total_elapsed': 91.95398092269897,
│ 'start_timestamp': '2023-10-17 10:09:25.229688',
│ 'finish_timestamp': '2023-10-17 10:10:57.183653'
}
Saved output: labels002_predictions.slp
However, the prediction file seems to be faulty and fails to open in the GUI, either when opening it directly or by merging it into the project, it throws an h5py based error.
Traceback (most recent call last):
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/gui/commands.py", line 289, in openProject
self.execute(OpenProject, filename=filename, first_open=first_open)
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/gui/commands.py", line 242, in execute
command().execute(context=self, params=kwargs)
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/gui/commands.py", line 138, in execute
self.do_with_signal(context, params)
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/gui/commands.py", line 162, in do_with_signal
cls.do_action(context, params)
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/gui/commands.py", line 727, in do_action
context.loadProjectFile(filename)
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/gui/commands.py", line 274, in loadProjectFile
self.execute(LoadProjectFile, filename=filename)
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/gui/commands.py", line 242, in execute
command().execute(context=self, params=kwargs)
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/gui/commands.py", line 138, in execute
self.do_with_signal(context, params)
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/gui/commands.py", line 162, in do_with_signal
cls.do_action(context, params)
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/gui/commands.py", line 675, in do_action
context.app.on_data_update([UpdateTopic.project, UpdateTopic.all])
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/gui/app.py", line 1166, in on_data_update
self.videos_dock.table.model().items = self.labels.videos
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/gui/dataviews.py", line 103, in items
item_data = self.item_to_data(obj, item)
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/gui/dataviews.py", line 392, in item_to_data
return {key: getattr(item, key) for key in self.properties}
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/gui/dataviews.py", line 392, in <dictcomp>
return {key: getattr(item, key) for key in self.properties}
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/io/video.py", line 1046, in __getattr__
return getattr(self.backend, item)
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/io/video.py", line 251, in frames
return self.__dataset_h5.shape[0]
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/io/video.py", line 154, in __dataset_h5
self._load()
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/sleap/io/video.py", line 131, in _load
self.__dataset_h5 = self.__file_h5[self.dataset]
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/home/xxx/mambaforge/envs/sleap133/lib/python3.7/site-packages/h5py/_hl/group.py", line 288, in __getitem__
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: 'Unable to open object (component not found)'
I have a very strong suspicion that it might have come from messy handling of my datasets, as there was several steps of using exported packages, then merging into it other videos, exporting, etc.... It might has brought some issues with path management.
Ideally, I would love to start clean with my videos and a new project, but still keep my hundred of annotations done before when performing training, and get metrics from them after inference.
Hi @olivier-cuttlefish,
Are you able to used the trained model to predict on just a normal slp file instead of the .pkg.slp? This will only work locally (since you probably don't have the videos uploaded to the drive), but it'll at least allow you to use the model to make new labels pretty quickly.
Right now you have a .pkg.slp listed as the data_path argument, but I'd like to try it with a normal .slp as the data_path.
It definitely seems like something has happened to the video paths upon merging. Packages export only the images needed for training (and inference - on suggestions), thus they don't reference the original videos anymore and instead reference a table in the h5 file (the .pkg.slp).
The .pkg.slp files are intended to only be used for exporting for remote training. They shouldn't really have new labels added to them. This discussion deals with a similar situation and might be helpful as well.
Hi @roomrys, sorry for the late reply as I was very busy the precedent days. I have been able indeed to predict on videos using the trained model. I have done other rounds of labeling and training and now my project file seems to be more stable in terms of pathfinding. Not sure how I did that though... Also, thank you for the discussion thread you shared, the script there seems to be useful if I ever get into path issues with my projects. Thank you very much for your help ! :)
Hello, First of all, thank you so much for your amazing tool and the support you are providing. I am using v1.3.3. I ran my training on the clusters. Now, I would like to run inference on the labeled frames contained in the training package. The command I am running is
sleap-track -m "models/231013_164017.multi_instance" --only-labeled-frames -o "labels002_predictions.slp" "labels.v002.merged.pkg.slp"
I also tried using --only-suggested-frames.
In both cases, it seems that sleap is trying to search for the original videos and pull the frames from there, while it should find them in the training package (as it was able to do so for the training).
Here is the traceback (/home/o/o-xxx/ is on the cluster, while /home/xxx/Documents/ are local paths):