Not able to initiate centered instance during training

BaylorBrangers commented 1 year ago

Hi, I during the initial training of the network the training fails on the first epoch during the instance training. I am not getting any obvious errors in the terminal and was wondering if anyone else had run into this error? Any help would be greatly appreciated.

I have pasted the output from the terminal below. Please let me know if there is any more information that is needed. Thanks!

INFO:sleap.nn.training:
INFO:sleap.nn.training:Initializing trainer...
video search paths:  ['/media/baylor/Data_Party2/fostrapDREADDS_LVL/141/DCZ_22_07_21/acc_test_videoTop2021-07-22T14_49_59.avi']
[Video(backend=MediaVideo(filename='/media/baylor/Data_Party2/fostrapDREADDS_LVL/141/DCZ_22_07_21/acc_test_videoTop2021-07-22T14_49_59.avi', grayscale=True, bgr=True, dataset='', input_format=''))]
INFO:sleap.nn.training:Setting up for training...
INFO:sleap.nn.training:Setting up pipeline builders...
INFO:sleap.nn.training:Setting up model...
INFO:sleap.nn.training:Building test pipeline...
2023-01-13 15:36:32.042013: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2023-01-13 15:36:32.067149: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-13 15:36:32.067357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 15 deviceMemorySize: 7.92GiB deviceMemoryBandwidth: 238.66GiB/s
2023-01-13 15:36:32.067416: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2023-01-13 15:36:32.067461: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2023-01-13 15:36:32.067505: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2023-01-13 15:36:32.067549: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory
2023-01-13 15:36:32.067623: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2023-01-13 15:36:32.067665: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory
2023-01-13 15:36:32.067708: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2023-01-13 15:36:32.067714: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1592] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2023-01-13 15:36:32.067900: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2023-01-13 15:36:32.091069: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2023-01-13 15:36:32.091616: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55caf3965b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-01-13 15:36:32.091632: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2023-01-13 15:36:32.141660: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-13 15:36:32.141833: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55caf3931a70 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-01-13 15:36:32.141847: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce GTX 1070, Compute Capability 6.1
2023-01-13 15:36:32.141938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-01-13 15:36:32.141944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      
WARNING:tensorflow:From /home/baylor/anaconda3/envs/sleap_env/lib/python3.6/site-packages/tensorflow_core/python/ops/math_ops.py:4051: setdiff1d (from tensorflow.python.ops.array_ops) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
This op will be removed after the deprecation date. Please switch to tf.sets.difference().
INFO:sleap.nn.training:Loaded test example. [1.266s]
INFO:sleap.nn.training:  Input shape: (160, 160, 1)
INFO:sleap.nn.training:Created Keras model.
INFO:sleap.nn.training:  Backbone: UNet(stacks=1, filters=24, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=4, middle_block=True, up_blocks=2, up_interpolate=True, block_contraction=False)
INFO:sleap.nn.training:  Max stride: 16
INFO:sleap.nn.training:  Parameters: 4,310,475
INFO:sleap.nn.training:  Heads: 
INFO:sleap.nn.training:  heads[0] = CenteredInstanceConfmapsHead(part_names=['head', 'body', 'tail'], anchor_part='body', sigma=5.0, output_stride=4, loss_weight=1.0)
INFO:sleap.nn.training:Setting up data pipelines...
INFO:sleap.nn.training:Training set: n = 18
INFO:sleap.nn.training:Validation set: n = 2
INFO:sleap.nn.training:Setting up optimization...
INFO:root:  OHKM enabled: HardKeypointMiningConfig(online_mining=True, hard_to_easy_ratio=2.0, min_hard_keypoints=2, max_hard_keypoints=None, loss_scale=5.0)
INFO:root:  Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08)
INFO:root:  Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-06, plateau_patience=10)
INFO:sleap.nn.training:Setting up outputs...
INFO:sleap.nn.callbacks:Training controller subscribed to: tcp://127.0.0.1:9000 (topic: )
INFO:sleap.nn.training:  ZMQ controller subcribed to: tcp://127.0.0.1:9000
INFO:sleap.nn.callbacks:Progress reporter publishing on: tcp://127.0.0.1:9001 for: not_set
INFO:sleap.nn.training:  ZMQ progress reporter publish on: tcp://127.0.0.1:9001
INFO:sleap.nn.training:Created run path: /home/baylor/Desktop/New Folder/models/230113_153630.centered_instance.20
INFO:sleap.nn.training:Setting up visualization...
INFO:sleap.nn.training:Finished trainer set up. [3.1s]
INFO:sleap.nn.training:Creating tf.data.Datasets for training data generation...
INFO:sleap.nn.training:Finished creating training datasets. [1.8s]
INFO:sleap.nn.training:Starting training loop...
Train for 200 steps, validate for 10 steps
Epoch 1/200
Run Path: /home/baylor/Desktop/New Folder/models/230113_153630.centered_instance.20
^CTraceback (most recent call last):
  File "/home/baylor/anaconda3/envs/sleap_env/lib/python3.6/site-packages/sleap/gui/app.py", line 893, in _update_gui_state
    control_key_down = QApplication.queryKeyboardModifiers() == Qt.ControlModifier

talmo commented 1 year ago

Hi @BaylorBrangers,

Just giving you a heads up that we're having a lab-wide event this week and will be a bit slower in responding to support requests.

In the meantime, just a couple of quick suggestions:

Based on your logs, it looks like you don't have GPU support. Did you try installing via the conda installation method?
If you're using the conda method and this is happening, you might need to update your GPU drivers.

We'll check back in next week when we're back to regular operations :)

Cheers,

Talmo

roomrys commented 1 year ago

Closing this issue due to inactivity. Comment below if you run into this same problem, and I will reopen it. Thanks!

talmolab / sleap

Not able to initiate centered instance during training #1118