Error while training centroid - It stops after the second or third epoch

Felippe-espinelli commented 3 years ago

Hi developers,

I was running the SLEAP GUI with a Top-down model to track 11 Drosophilas. I was able to do the training-inference-proofreading cycle 3 times, so now I have 80 frames. Now with this new cycle, something is occuring that the centroid model training stops, gives an error and starts to train the centered instance model. I was initially suspecting that was the missing labeled frames that I did not check by mistake. However, after labelling everything the error continues. Also saw some bug happening with my graphic cards which the tensorflow was not recognizing, but manage to fix that changing the secure boot configuration. However, the problem still continues.

I would really appreciate any help to solve this issue. Thank you in advance

Resetting monitor window. Polling: /home/felippe/Documents/Sleap_project/models/210929_164539.centroid.n=80/viz/validation.*.png Start training centroid... ['sleap-train', '/tmp/tmpfjakit2k/210929_164539_training_job.json', '/home/felippe/Documents/Sleap_project/labels_28_09.v000.slp', '--zmq', '--save_viz', '--video-paths', '/home/felippe/Downloads/video 01_09_2021.avi'] 2021-09-29 16:45:40.794567: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 INFO:sleap.nn.training:Versions: SLEAP: 1.1.5 TensorFlow: 2.3.1 Numpy: 1.18.5 Python: 3.6.13 OS: Linux-5.11.0-36-generic-x86_64-with-debian-bullseye-sid INFO:sleap.nn.training:Training labels file: /home/felippe/Documents/Sleap_project/labels_28_09.v000.slp INFO:sleap.nn.training:Training profile: /tmp/tmpfjakit2k/210929_164539_training_job.json INFO:sleap.nn.training: INFO:sleap.nn.training:Arguments: INFO:sleap.nn.training:{ "training_job_path": "/tmp/tmpfjakit2k/210929_164539_training_job.json", "labels_path": "/home/felippe/Documents/Sleap_project/labels_28_09.v000.slp", "video_paths": "/home/felippe/Downloads/video 01_09_2021.avi", "val_labels": null, "test_labels": null, "tensorboard": false, "save_viz": true, "zmq": true, "run_name": "", "prefix": "", "suffix": "", "cpu": false, "first_gpu": false, "last_gpu": false, "gpu": 0 } INFO:sleap.nn.training: INFO:sleap.nn.training:Training job: INFO:sleap.nn.training:{ "data": { "labels": { "training_labels": null, "validation_labels": null, "validation_fraction": 0.1, "test_labels": null, "split_by_inds": false, "training_inds": null, "validation_inds": null, "test_inds": null, "search_path_hints": [], "skeletons": [] }, "preprocessing": { "ensure_rgb": false, "ensure_grayscale": false, "imagenet_mode": null, "input_scaling": 0.5, "pad_to_stride": null, "resize_and_pad_to_target": true, "target_height": null, "target_width": null }, "instance_cropping": { "center_on_part": "thorax", "crop_size": null, "crop_size_detection_padding": 16 } }, "model": { "backbone": { "leap": null, "unet": { "stem_stride": null, "max_stride": 16, "output_stride": 2, "filters": 16, "filters_rate": 2.0, "middle_block": true, "up_interpolate": true, "stacks": 1 }, "hourglass": null, "resnet": null, "pretrained_encoder": null }, "heads": { "single_instance": null, "centroid": { "anchor_part": "thorax", "sigma": 5.0, "output_stride": 2, "offset_refinement": false }, "centered_instance": null, "multi_instance": null } }, "optimization": { "preload_data": true, "augmentation_config": { "rotate": true, "rotation_min_angle": -180.0, "rotation_max_angle": 180.0, "translate": false, "translate_min": -5, "translate_max": 5, "scale": false, "scale_min": 0.9, "scale_max": 1.1, "uniform_noise": false, "uniform_noise_min_val": 0.0, "uniform_noise_max_val": 10.0, "gaussian_noise": false, "gaussian_noise_mean": 5.0, "gaussian_noise_stddev": 1.0, "contrast": false, "contrast_min_gamma": 0.5, "contrast_max_gamma": 2.0, "brightness": false, "brightness_min_val": 0.0, "brightness_max_val": 10.0, "random_crop": false, "random_crop_height": 256, "random_crop_width": 256, "random_flip": false, "flip_horizontal": true }, "online_shuffling": true, "shuffle_buffer_size": 128, "prefetch": true, "batch_size": 4, "batches_per_epoch": null, "min_batches_per_epoch": 200, "val_batches_per_epoch": null, "min_val_batches_per_epoch": 10, "epochs": 10, "optimizer": "adam", "initial_learning_rate": 0.0001, "learning_rate_schedule": { "reduce_on_plateau": true, "reduction_factor": 0.5, "plateau_min_delta": 1e-06, "plateau_patience": 5, "plateau_cooldown": 3, "min_learning_rate": 1e-08 }, "hard_keypoint_mining": { "online_mining": false, "hard_to_easy_ratio": 2.0, "min_hard_keypoints": 2, "max_hard_keypoints": null, "loss_scale": 5.0 }, "early_stopping": { "stop_training_on_plateau": true, "plateau_min_delta": 1e-06, "plateau_patience": 10 } }, "outputs": { "save_outputs": true, "run_name": "210929_164539.centroid.n=80", "run_name_prefix": "", "run_name_suffix": "", "runs_folder": "/home/felippe/Documents/Sleap_project/models", "tags": [ "" ], "save_visualizations": true, "delete_viz_images": true, "zip_outputs": false, "log_to_csv": true, "checkpointing": { "initial_model": false, "best_model": true, "every_epoch": false, "latest_model": false, "final_model": false }, "tensorboard": { "write_logs": false, "loss_frequency": "epoch", "architecture_graph": false, "profile_graph": false, "visualizations": true }, "zmq": { "subscribe_to_controller": true, "controller_address": "tcp://127.0.0.1:9000", "controller_polling_timeout": 10, "publish_updates": true, "publish_address": "tcp://127.0.0.1:9001" } }, "name": "", "description": "", "sleap_version": "1.1.5", "filename": "/tmp/tmpfjakit2k/210929_164539_training_job.json" } INFO:sleap.nn.training: 2021-09-29 16:45:41.633082: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1 2021-09-29 16:45:41.659625: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-09-29 16:45:41.660252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2070 with Max-Q Design computeCapability: 7.5 coreClock: 1.125GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 327.88GiB/s 2021-09-29 16:45:41.660324: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2021-09-29 16:45:41.662393: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2021-09-29 16:45:41.664129: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2021-09-29 16:45:41.664471: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2021-09-29 16:45:41.666402: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2021-09-29 16:45:41.667389: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2021-09-29 16:45:41.671003: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2021-09-29 16:45:41.671134: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-09-29 16:45:41.671582: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-09-29 16:45:41.671920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 INFO:sleap.nn.training:Using GPU 0 for acceleration. INFO:sleap.nn.training:Disabled GPU memory pre-allocation. INFO:sleap.nn.training:System: GPUs: 1/1 available Device: /physical_device:GPU:0 Available: True Initalized: False Memory growth: True INFO:sleap.nn.training: INFO:sleap.nn.training:Initializing trainer... INFO:sleap.nn.training:Loading training labels from: /home/felippe/Documents/Sleap_project/labels_28_09.v000.slp INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.1 INFO:sleap.nn.training: Splits: Training = 72 / Validation = 8. INFO:sleap.nn.training:Setting up for training... INFO:sleap.nn.training:Setting up pipeline builders... INFO:sleap.nn.training:Setting up model... INFO:sleap.nn.training:Building test pipeline... 2021-09-29 16:45:42.170050: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-09-29 16:45:42.193730: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2599990000 Hz 2021-09-29 16:45:42.194766: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d03ac0f100 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2021-09-29 16:45:42.194782: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2021-09-29 16:45:42.252891: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-09-29 16:45:42.253212: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d03ac7ac40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2021-09-29 16:45:42.253225: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce RTX 2070 with Max-Q Design, Compute Capability 7.5 2021-09-29 16:45:42.254231: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-09-29 16:45:42.254503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2070 with Max-Q Design computeCapability: 7.5 coreClock: 1.125GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 327.88GiB/s 2021-09-29 16:45:42.254580: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2021-09-29 16:45:42.254624: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2021-09-29 16:45:42.254657: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2021-09-29 16:45:42.254694: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2021-09-29 16:45:42.254726: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2021-09-29 16:45:42.254763: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2021-09-29 16:45:42.254795: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2021-09-29 16:45:42.254876: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-09-29 16:45:42.255163: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-09-29 16:45:42.255415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 2021-09-29 16:45:42.256195: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2021-09-29 16:45:42.714049: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-09-29 16:45:42.714090: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0 2021-09-29 16:45:42.714095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N 2021-09-29 16:45:42.714908: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-09-29 16:45:42.715250: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-09-29 16:45:42.715524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6848 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2070 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 7.5) INFO:sleap.nn.training:Loaded test example. [2.307s] INFO:sleap.nn.training: Input shape: (1088, 2048, 3) INFO:sleap.nn.training:Created Keras model. INFO:sleap.nn.training: Backbone: UNet(stacks=1, filters=16, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=4, middle_block=True, up_blocks=3, up_interpolate=True, block_contraction=False) INFO:sleap.nn.training: Max stride: 16 INFO:sleap.nn.training: Parameters: 1,953,393 INFO:sleap.nn.training: Heads: INFO:sleap.nn.training: [0] = CentroidConfmapsHead(anchor_part='thorax', sigma=5.0, output_stride=2, loss_weight=1.0) INFO:sleap.nn.training: Outputs: INFO:sleap.nn.training: [0] = Tensor("CentroidConfmapsHead_0/BiasAdd:0", shape=(None, 544, 1024, 1), dtype=float32) INFO:sleap.nn.training:Setting up data pipelines... INFO:sleap.nn.training:Training set: n = 72 INFO:sleap.nn.training:Validation set: n = 8 INFO:sleap.nn.training:Setting up optimization... INFO:sleap.nn.training: Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08) INFO:sleap.nn.training: Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-06, plateau_patience=10) INFO:sleap.nn.training:Setting up outputs... INFO:sleap.nn.callbacks:Training controller subscribed to: tcp://127.0.0.1:9000 (topic: ) INFO:sleap.nn.training: ZMQ controller subcribed to: tcp://127.0.0.1:9000 INFO:sleap.nn.callbacks:Progress reporter publishing on: tcp://127.0.0.1:9001 for: not_set INFO:sleap.nn.training: ZMQ progress reporter publish on: tcp://127.0.0.1:9001 INFO:sleap.nn.training:Created run path: /home/felippe/Documents/Sleap_project/models/210929_164539.centroid.n=80 INFO:sleap.nn.training:Setting up visualization... INFO:sleap.nn.training:Finished trainer set up. [5.7s] INFO:sleap.nn.training:Creating tf.data.Datasets for training data generation... INFO:sleap.nn.training:Finished creating training datasets. [19.5s] INFO:sleap.nn.training:Starting training loop... Epoch 1/10 2021-09-29 16:46:09.139988: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2021-09-29 16:46:10.662764: W tensorflow/stream_executor/gpu/asm_compiler.cc:81] Running ptxas --version returned 256 2021-09-29 16:46:10.833879: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 256, output: Relying on driver to perform ptx compilation. Modify $PATH to customize ptxas location. This message will be only logged once. 2021-09-29 16:46:12.627279: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2021-09-29 16:46:16.619575: W tensorflow/core/common_runtime/bfc_allocator.cc:312] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature. 2021-09-29 16:46:34.070863: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.60GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2021-09-29 16:46:39.675101: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.21GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2021-09-29 16:46:41.865086: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.41GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2021-09-29 16:46:42.566438: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 832.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2021-09-29 16:46:42.566458: W tensorflow/core/kernels/gpu_utils.cc:49] Failed to allocate memory for convolution redzone checking; skipping this check. This is benign and only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once. 2021-09-29 16:46:45.930982: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.41GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. WARNING:tensorflow:Callbacks method on_train_batch_end is slow compared to the batch time (batch time: 0.2279s vs on_train_batch_end time: 0.4963s). Check your callbacks. WARNING:tensorflow:Callbacks method on_test_batch_end is slow compared to the batch time (batch time: 0.0724s vs on_test_batch_end time: 0.1630s). Check your callbacks. 2021-09-29 16:50:24.897530: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 152.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2021-09-29 16:50:24.897611: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 28.75MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2021-09-29 16:50:25.027907: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 245.50MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2021-09-29 16:50:25.028058: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.02MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2021-09-29 16:50:25.028073: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.40MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 200/200 - 382s - loss: 6.2910e-04 - val_loss: 8.9571e-05 Epoch 2/10 Polling: /home/felippe/Documents/Sleap_project/models/210929_164539.centroid.n=80/viz/validation..png Run Path: /home/felippe/Documents/Sleap_project/models/210929_164539.centroid.n=80 Resetting monitor window. Polling: /home/felippe/Documents/Sleap_project/models/210929_165857.centered_instance.n=80/viz/validation..png Start training centered_instance...

talmo commented 3 years ago

Hi @Felippe-espinelli,

Thanks for all the info!

So when you say that it "stops" after the second or third epoch, does that mean it freezes (not responsive), crashes (closes the app) or seems fine but training doesn't progress (timer at the top keeps going but loss doesn't update)?

Cheers,

Talmo

Felippe-espinelli commented 3 years ago

Hi Talmo,

Thanks for the quick reply. After running for some time, the first model training stops, and a window with the following message appears: "An error occurred while training centroid. Your command line terminal may have more information about the error". When I press ok or close this error window, it jumps straight to the centered instance model training. Also, the training of the second model occurs without any issue if I let it run.

Cheers, Felippe

Felippe-espinelli commented 3 years ago

Hi Talmo,

I think I found the issue testing different parameters. I am using a 4k resolution video to make the multi track. Initially is okay to train the model. However, after 3 training sessions, the project has several labeled frames to be used during the training process which is a lot of data. Since no significant error appeared, I created a new project using the same video and labeled frames to check if there was an issue with the project by itself. Now, a new error appeared in my console (Sorry, I lost it). Apparently, after some epochs, the GPU (Geforce RTX2070) was running out of memory and the training of the first model stops, going straight to the training of the second model. I went back to the old project and reduced the batch size from 4 to 1. That apparently solved the issue and I can run the whole training again. Thank you for the help.

Cheers, Felippe

talmolab / sleap

Error while training centroid - It stops after the second or third epoch #587