talmolab / sleap

A deep learning framework for multi-animal pose tracking.
https://sleap.ai
Other
427 stars 97 forks source link

Not able to initiate centered instance during training #1118

Closed BaylorBrangers closed 1 year ago

BaylorBrangers commented 1 year ago

Hi, I during the initial training of the network the training fails on the first epoch during the instance training. I am not getting any obvious errors in the terminal and was wondering if anyone else had run into this error? Any help would be greatly appreciated.

I have pasted the output from the terminal below. Please let me know if there is any more information that is needed. Thanks!

INFO:sleap.nn.training:
INFO:sleap.nn.training:Initializing trainer...
video search paths:  ['/media/baylor/Data_Party2/fostrapDREADDS_LVL/141/DCZ_22_07_21/acc_test_videoTop2021-07-22T14_49_59.avi']
[Video(backend=MediaVideo(filename='/media/baylor/Data_Party2/fostrapDREADDS_LVL/141/DCZ_22_07_21/acc_test_videoTop2021-07-22T14_49_59.avi', grayscale=True, bgr=True, dataset='', input_format=''))]
INFO:sleap.nn.training:Setting up for training...
INFO:sleap.nn.training:Setting up pipeline builders...
INFO:sleap.nn.training:Setting up model...
INFO:sleap.nn.training:Building test pipeline...
2023-01-13 15:36:32.042013: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2023-01-13 15:36:32.067149: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-13 15:36:32.067357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 15 deviceMemorySize: 7.92GiB deviceMemoryBandwidth: 238.66GiB/s
2023-01-13 15:36:32.067416: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2023-01-13 15:36:32.067461: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2023-01-13 15:36:32.067505: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2023-01-13 15:36:32.067549: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory
2023-01-13 15:36:32.067623: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2023-01-13 15:36:32.067665: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory
2023-01-13 15:36:32.067708: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2023-01-13 15:36:32.067714: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1592] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2023-01-13 15:36:32.067900: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2023-01-13 15:36:32.091069: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2023-01-13 15:36:32.091616: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55caf3965b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-01-13 15:36:32.091632: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2023-01-13 15:36:32.141660: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-13 15:36:32.141833: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55caf3931a70 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-01-13 15:36:32.141847: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce GTX 1070, Compute Capability 6.1
2023-01-13 15:36:32.141938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-01-13 15:36:32.141944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      
WARNING:tensorflow:From /home/baylor/anaconda3/envs/sleap_env/lib/python3.6/site-packages/tensorflow_core/python/ops/math_ops.py:4051: setdiff1d (from tensorflow.python.ops.array_ops) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
This op will be removed after the deprecation date. Please switch to tf.sets.difference().
INFO:sleap.nn.training:Loaded test example. [1.266s]
INFO:sleap.nn.training:  Input shape: (160, 160, 1)
INFO:sleap.nn.training:Created Keras model.
INFO:sleap.nn.training:  Backbone: UNet(stacks=1, filters=24, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=4, middle_block=True, up_blocks=2, up_interpolate=True, block_contraction=False)
INFO:sleap.nn.training:  Max stride: 16
INFO:sleap.nn.training:  Parameters: 4,310,475
INFO:sleap.nn.training:  Heads: 
INFO:sleap.nn.training:  heads[0] = CenteredInstanceConfmapsHead(part_names=['head', 'body', 'tail'], anchor_part='body', sigma=5.0, output_stride=4, loss_weight=1.0)
INFO:sleap.nn.training:Setting up data pipelines...
INFO:sleap.nn.training:Training set: n = 18
INFO:sleap.nn.training:Validation set: n = 2
INFO:sleap.nn.training:Setting up optimization...
INFO:root:  OHKM enabled: HardKeypointMiningConfig(online_mining=True, hard_to_easy_ratio=2.0, min_hard_keypoints=2, max_hard_keypoints=None, loss_scale=5.0)
INFO:root:  Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08)
INFO:root:  Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-06, plateau_patience=10)
INFO:sleap.nn.training:Setting up outputs...
INFO:sleap.nn.callbacks:Training controller subscribed to: tcp://127.0.0.1:9000 (topic: )
INFO:sleap.nn.training:  ZMQ controller subcribed to: tcp://127.0.0.1:9000
INFO:sleap.nn.callbacks:Progress reporter publishing on: tcp://127.0.0.1:9001 for: not_set
INFO:sleap.nn.training:  ZMQ progress reporter publish on: tcp://127.0.0.1:9001
INFO:sleap.nn.training:Created run path: /home/baylor/Desktop/New Folder/models/230113_153630.centered_instance.20
INFO:sleap.nn.training:Setting up visualization...
INFO:sleap.nn.training:Finished trainer set up. [3.1s]
INFO:sleap.nn.training:Creating tf.data.Datasets for training data generation...
INFO:sleap.nn.training:Finished creating training datasets. [1.8s]
INFO:sleap.nn.training:Starting training loop...
Train for 200 steps, validate for 10 steps
Epoch 1/200
Run Path: /home/baylor/Desktop/New Folder/models/230113_153630.centered_instance.20
^CTraceback (most recent call last):
  File "/home/baylor/anaconda3/envs/sleap_env/lib/python3.6/site-packages/sleap/gui/app.py", line 893, in _update_gui_state
    control_key_down = QApplication.queryKeyboardModifiers() == Qt.ControlModifier
talmo commented 1 year ago

Hi @BaylorBrangers,

Just giving you a heads up that we're having a lab-wide event this week and will be a bit slower in responding to support requests.

In the meantime, just a couple of quick suggestions:

We'll check back in next week when we're back to regular operations :)

Cheers,

Talmo

roomrys commented 1 year ago

Closing this issue due to inactivity. Comment below if you run into this same problem, and I will reopen it. Thanks!