talmolab / sleap

A deep learning framework for multi-animal pose tracking.
https://sleap.ai
Other
432 stars 96 forks source link

training error: "ValueError: need at least one array to stack" #555

Closed vcorbit closed 3 years ago

vcorbit commented 3 years ago

I am trying to train a network on 201 labeled frames on the TigerGPU server. I previously trained this network with 100 labeled frames, but I added some more training data to improve the tracking. I am using the exact same methods and scripts I used in the past for this network and others, but now I'm getting this error:

2021-05-28 11:35:47.576700: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cudnn/cuda-10.2/7.6.5/lib64:/usr/local/cuda-10.2/lib64:/usr/local/cudnn/cuda-9.2/7.3.1/lib64:/usr/local/cuda-9.2/lib64:/usr/lib64/nvidia 2021-05-28 11:35:47.581465: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cudnn/cuda-10.2/7.6.5/lib64:/usr/local/cuda-10.2/lib64:/usr/local/cudnn/cuda-9.2/7.3.1/lib64:/usr/local/cuda-9.2/lib64:/usr/lib64/nvidia 2021-05-28 11:35:47.583481: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. INFO:sleap.nn.training:Training labels file: /tigress/vcorbit/SLEAP/BarnesEtho-it3.h5 INFO:sleap.nn.training:Training profile: /tigress/vcorbit/SLEAP/single_instance.json INFO:sleap.nn.training: INFO:sleap.nn.training:Arguments: INFO:sleap.nn.training:{ "training_job_path": "/tigress/vcorbit/SLEAP/single_instance.json", "labels_path": "/tigress/vcorbit/SLEAP/BarnesEtho-it3.h5", "video_paths": "", "val_labels": null, "test_labels": null, "tensorboard": false, "save_viz": false, "zmq": false, "run_name": "", "prefix": "", "suffix": "" } INFO:sleap.nn.training: INFO:sleap.nn.training:Training job: INFO:sleap.nn.training:{ "data": { "labels": { "training_labels": null, "validation_labels": null, "validation_fraction": 0.1, "test_labels": null, "search_path_hints": [], "skeletons": [] }, "preprocessing": { "ensure_rgb": false, "ensure_grayscale": false, "imagenet_mode": null, "input_scaling": 1.0, "pad_to_stride": null }, "instance_cropping": { "center_on_part": null, "crop_size": null, "crop_size_detection_padding": 16 } }, "model": { "backbone": { "leap": null, "unet": { "stem_stride": null, "max_stride": 32, "output_stride": 4, "filters": 16, "filters_rate": 2.0, "middle_block": true, "up_interpolate": true, "stacks": 1 }, "hourglass": null, "resnet": null }, "heads": { "single_instance": { "part_names": null, "sigma": 5.0, "output_stride": 4 }, "centroid": null, "centered_instance": null, "multi_instance": null } }, "optimization": { "preload_data": true, "augmentation_config": { "rotate": false, "rotation_min_angle": -180.0, "rotation_max_angle": 180.0, "translate": false, "translate_min": -5, "translate_max": 5, "scale": true, "scale_min": 0.9, "scale_max": 1.1, "uniform_noise": false, "uniform_noise_min_val": 0.0, "uniform_noise_max_val": 10.0, "gaussian_noise": false, "gaussian_noise_mean": 5.0, "gaussian_noise_stddev": 1.0, "contrast": false, "contrast_min_gamma": 0.5, "contrast_max_gamma": 2.0, "brightness": false, "brightness_min_val": 0.0, "brightness_max_val": 10.0 }, "online_shuffling": true, "shuffle_buffer_size": 128, "prefetch": true, "batch_size": 4, "batches_per_epoch": null, "min_batches_per_epoch": 200, "val_batches_per_epoch": null, "min_val_batches_per_epoch": 10, "epochs": 100, "optimizer": "adam", "initial_learning_rate": 0.0001, "learning_rate_schedule": { "reduce_on_plateau": true, "reduction_factor": 0.5, "plateau_min_delta": 1e-06, "plateau_patience": 5, "plateau_cooldown": 3, "min_learning_rate": 1e-08 }, "hard_keypoint_mining": { "online_mining": true, "hard_to_easy_ratio": 2.0, "min_hard_keypoints": 2, "max_hard_keypoints": null, "loss_scale": 5.0 }, "early_stopping": { "stop_training_on_plateau": true, "plateau_min_delta": 1e-06, "plateau_patience": 10 } }, "outputs": { "save_outputs": true, "run_name": "201005_161019", "run_name_prefix": "", "run_name_suffix": "", "runs_folder": "", "tags": [ "" ], "save_visualizations": false, "log_to_csv": true, "checkpointing": { "initial_model": false, "best_model": true, "every_epoch": false, "latest_model": false, "final_model": false }, "tensorboard": { "write_logs": false, "loss_frequency": "epoch", "architecture_graph": false, "profile_graph": false, "visualizations": true }, "zmq": { "subscribe_to_controller": false, "controller_address": "tcp://127.0.0.1:9000", "controller_polling_timeout": 10, "publish_updates": false, "publish_address": "tcp://127.0.0.1:9001" } } } INFO:sleap.nn.training: INFO:sleap.nn.training:System: 2021-05-28 11:36:00.443236: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2021-05-28 11:36:00.466359: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: pciBusID: 0000:03:00.0 name: NVIDIA Tesla P100-PCIE-16GB computeCapability: 6.0 coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s 2021-05-28 11:36:00.472854: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2021-05-28 11:36:00.537496: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2021-05-28 11:36:00.564469: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2021-05-28 11:36:00.572660: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2021-05-28 11:36:00.621135: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2021-05-28 11:36:00.631156: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2021-05-28 11:36:00.724008: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-05-28 11:36:00.726861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0 GPUs: 1/1 available Device: /physical_device:GPU:0 Available: True Initalized: False Memory growth: True INFO:sleap.nn.training: INFO:sleap.nn.training:Initializing trainer... video search paths: [''] [Video(backend=HDF5Video(filename='/tigress/vcorbit/SLEAP/BarnesEtho-it3.h5', dataset='video0/video', input_format='channels_last', convert_range=False)), Video(backend=HDF5Video(filename='/tigress/vcorbit/SLEAP/BarnesEtho-it3.h5', dataset='video1/video', input_format='channels_last', convert_range=False)), Video(backend=HDF5Video(filename='/tigress/vcorbit/SLEAP/BarnesEtho-it3.h5', dataset='video2/video', input_format='channels_last', convert_range=False)), Video(backend=HDF5Video(filename='/tigress/vcorbit/SLEAP/BarnesEtho-it3.h5', dataset='video3/video', input_format='channels_last', convert_range=False)), Video(backend=HDF5Video(filename='/tigress/vcorbit/SLEAP/BarnesEtho-it3.h5', dataset='video4/video', input_format='channels_last', convert_range=False)), Video(backend=HDF5Video(filename='/tigress/vcorbit/SLEAP/BarnesEtho-it3.h5', dataset='video5/video', input_format='channels_last', convert_range=False)), Video(backend=HDF5Video(filename='/tigress/vcorbit/SLEAP/BarnesEtho-it3.h5', dataset='video6/video', input_format='channels_last', convert_range=False)), Video(backend=HDF5Video(filename='/tigress/vcorbit/SLEAP/BarnesEtho-it3.h5', dataset='video7/video', input_format='channels_last', convert_range=False)), Video(backend=HDF5Video(filename='/tigress/vcorbit/SLEAP/BarnesEtho-it3.h5', dataset='video8/video', input_format='channels_last', convert_range=False)), Video(backend=HDF5Video(filename='/tigress/vcorbit/SLEAP/BarnesEtho-it3.h5', dataset='video9/video', input_format='channels_last', convert_range=False)), Video(backend=HDF5Video(filename='/tigress/vcorbit/SLEAP/BarnesEtho-it3.h5', dataset='video10/video', input_format='channels_last', convert_range=False)), Video(backend=HDF5Video(filename='/tigress/vcorbit/SLEAP/BarnesEtho-it3.h5', dataset='video11/video', input_format='channels_last', convert_range=False))] INFO:sleap.nn.training:Setting up for training... INFO:sleap.nn.training:Setting up pipeline builders... INFO:sleap.nn.training:Setting up model... INFO:sleap.nn.training:Building test pipeline... 2021-05-28 11:36:01.176030: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2021-05-28 11:36:01.183094: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2399990000 Hz 2021-05-28 11:36:01.186255: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557c0e2bfd10 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2021-05-28 11:36:01.187960: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2021-05-28 11:36:01.589651: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557c0e346580 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2021-05-28 11:36:01.594955: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA Tesla P100-PCIE-16GB, Compute Capability 6.0 2021-05-28 11:36:01.606790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: pciBusID: 0000:03:00.0 name: NVIDIA Tesla P100-PCIE-16GB computeCapability: 6.0 coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s 2021-05-28 11:36:01.616585: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2021-05-28 11:36:01.620712: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2021-05-28 11:36:01.624765: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2021-05-28 11:36:01.628526: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2021-05-28 11:36:01.629738: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2021-05-28 11:36:01.633449: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2021-05-28 11:36:01.635787: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-05-28 11:36:01.639279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0 2021-05-28 11:36:01.640199: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2021-05-28 11:36:01.643163: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-05-28 11:36:01.644048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0 2021-05-28 11:36:01.644806: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N 2021-05-28 11:36:01.648283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15224 MB memory) -> physical GPU (device: 0, name: NVIDIA Tesla P100-PCIE-16GB, pci bus id: 0000:03:00.0, compute capability: 6.0) INFO:sleap.nn.training:Loaded test example. [2.727s] INFO:sleap.nn.training: Input shape: (480, 704, 3) INFO:sleap.nn.training:Created Keras model. INFO:sleap.nn.training: Backbone: UNet(stacks=1, filters=16, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=5, middle_block=True, up_blocks=3, up_interpolate=True, block_contraction=False) INFO:sleap.nn.training: Max stride: 32 INFO:sleap.nn.training: Parameters: 7,816,403 INFO:sleap.nn.training: Heads: INFO:sleap.nn.training: heads[0] = SingleInstanceConfmapsHead(part_names=['snout', 'bodycenter', 'tailbase'], sigma=5.0, output_stride=4, loss_weight=1.0) INFO:sleap.nn.training:Setting up data pipelines... INFO:sleap.nn.training:Training set: n = 180 INFO:sleap.nn.training:Validation set: n = 21 INFO:sleap.nn.training:Setting up optimization... INFO:sleap.nn.training: OHKM enabled: HardKeypointMiningConfig(online_mining=True, hard_to_easy_ratio=2.0, min_hard_keypoints=2, max_hard_keypoints=None, loss_scale=5.0) INFO:sleap.nn.training: Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08) INFO:sleap.nn.training: Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-06, plateau_patience=10) INFO:sleap.nn.training:Setting up outputs... INFO:sleap.nn.training:Created run path: 201005_161019 INFO:sleap.nn.training:Setting up visualization... Unable to use Qt backend for matplotlib. This probably means Qt is running headless. INFO:sleap.nn.training:Finished trainer set up. [31.0s] INFO:sleap.nn.training:Creating tf.data.Datasets for training data generation... 2021-05-28 11:36:34.860919: W tensorflow/core/framework/op_kernel.cc:1643] Invalid argument: ValueError: need at least one array to stack Traceback (most recent call last):

File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 234, in call return func(device, token, args)

File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 123, in call ret = self._func(*args)

File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/sleap/nn/data/providers.py", line 141, in py_fetch_lf [inst.points_array.astype("float32") for inst in lf.instances], axis=0

File "<__array_function__ internals>", line 6, in stack

File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/numpy/core/shape_base.py", line 422, in stack raise ValueError('need at least one array to stack')

ValueError: need at least one array to stack

2021-05-28 11:36:34.868575: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at iterator_ops.cc:941 : Invalid argument: ValueError: need at least one array to stack Traceback (most recent call last):

File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 234, in call return func(device, token, args)

File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 123, in call ret = self._func(*args)

File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/sleap/nn/data/providers.py", line 141, in py_fetch_lf [inst.points_array.astype("float32") for inst in lf.instances], axis=0

File "<__array_function__ internals>", line 6, in stack

File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/numpy/core/shape_base.py", line 422, in stack raise ValueError('need at least one array to stack')

ValueError: need at least one array to stack

 [[{{node EagerPyFunc}}]]

Traceback (most recent call last): File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/eager/context.py", line 1897, in execution_mode yield File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 659, in _next_internal output_shapes=self._flat_output_shapes) File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_dataset_ops.py", line 2479, in iterator_get_next_sync _ops.raise_from_not_ok_status(e, name) File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status six.raise_from(core._status_to_exception(e.code, message), None) File "", line 3, in raise_from tensorflow.python.framework.errors_impl.InvalidArgumentError: ValueError: need at least one array to stack Traceback (most recent call last):

File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 234, in call return func(device, token, args)

File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 123, in call ret = self._func(*args)

File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/sleap/nn/data/providers.py", line 141, in py_fetch_lf [inst.points_array.astype("float32") for inst in lf.instances], axis=0

File "<__array_function__ internals>", line 6, in stack

File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/numpy/core/shape_base.py", line 422, in stack raise ValueError('need at least one array to stack')

ValueError: need at least one array to stack

 [[{{node EagerPyFunc}}]] [Op:IteratorGetNextSync]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/vcorbit/.conda/envs/sleap_dev_env/bin/sleap-train", line 8, in sys.exit(main()) File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/sleap/nn/training.py", line 1375, in main trainer.train() File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/sleap/nn/training.py", line 799, in train training_ds = self.training_pipeline.make_dataset() File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/sleap/nn/data/pipelines.py", line 275, in make_dataset ds = transformer.transform_dataset(ds) File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/sleap/nn/data/dataset_ops.py", line 319, in transform_dataset self.examples = list(iter(ds_input)) File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 630, in next return self.next() File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 674, in next return self._next_internal() File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 665, in _next_internal return structure.from_compatible_tensor_list(self._element_spec, ret) File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/contextlib.py", line 99, in exit self.gen.throw(type, value, traceback) File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/eager/context.py", line 1900, in execution_mode executor_new.wait() File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/eager/executor.py", line 67, in wait pywrap_tensorflow.TFE_ExecutorWaitForAllPendingNodes(self._handle) tensorflow.python.framework.errors_impl.InvalidArgumentError: ValueError: need at least one array to stack Traceback (most recent call last):

File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 234, in call return func(device, token, args)

File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 123, in call ret = self._func(*args)

File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/sleap/nn/data/providers.py", line 141, in py_fetch_lf [inst.points_array.astype("float32") for inst in lf.instances], axis=0

File "<__array_function__ internals>", line 6, in stack

File "/home/vcorbit/.conda/envs/sleap_dev_env/lib/python3.6/site-packages/numpy/core/shape_base.py", line 422, in stack raise ValueError('need at least one array to stack')

ValueError: need at least one array to stack

 [[{{node EagerPyFunc}}]]

Can you help me understand what the issue is? The first iteration of training worked fine and I used this same batch script to run training for another network, and that works as well. I can't figure out what the issue is..

arie-matsliah commented 3 years ago

Hi @vcorbit Can you confirm that these files and their content are valid? "training_job_path": "/tigress/vcorbit/SLEAP/single_instance.json", "labels_path": "/tigress/vcorbit/SLEAP/BarnesEtho-it3.h5",

I've never seen this error before ("need at least one array to stack") but from Googling it I think it means that data could not be loaded.

vcorbit commented 3 years ago

Hi, yes, can confirm those files exist and seem to be all good... I use the same "single_instance.json" for all of my training, and it works fine with another training dataset. I also opened up the training dataset after I created and it opened just fine in the SLEAP GUI, and all seemed well.

vcorbit commented 3 years ago

Update: I went through every frame in the training data and checked to make sure everything looked right. Didn't see any issues and reran the training, got the same error. I also tried running it with all the predicted frames deleted (only keeping the frames I manually labeled), and tried running it with all the predicted frames. Both times got that error. I agree it seems like some data is missing (?) but I really can't figure out why, since everything looks good to me.

yotamSagiv commented 3 years ago

Have there been any developments on this? I had the same error, just trying to run training directly from the GUI for the first time on some hand-labelled images.

arie-matsliah commented 3 years ago

Hi, I will look into this and update. If you can share exact steps for reproducing it'll help. Thanks

AbedNashef commented 3 years ago

Hi, I have similar problem as @yotamSagiv. For me, I followed the directions in the tutorial in sleep.au, labeled ~20 frames (I tried also 10-50 frames, just in case), and when I run the training, it is cut short with an error dialog box. In the terminal I have similar error to @vcorbit: ValueError: need at least one array to stack One difference from the tutorial is that I'm labelling frames from single animal, with Sigma for nodes of 2.5. In the models folder a new folder is created with 5 files after each run: initial_config.json, labels_gt.train.slp, labels_gt.val.slp, training_config.json and an empty training_log.csv. Thank you!

yotamSagiv commented 3 years ago

For me it is similar. I labelled 250 frames by hand across 5 videos imported into the project. Running training using the single_instance method directly from the GUI yields the error. Exporting the training job package and training on Colab also fails for the same reason.

Picking different videos and labelling 100 frames afterwards didn't cause the error. I don't know what the difference is.

talmo commented 3 years ago

Hi guys,

Sorry for the delay -- I've been moving across coasts this week and it's been a bit hectic.

It sounds like the common thread here is that everyone's running into issues with training single animal models. Our internal tests indicate that training this model type is working, but I'm guessing there's something we're not checking for.

Just tried training a single instance model myself from scratch and can't seem to reproduce this error.

@yotamSagiv @AbedNashef @vcorbit : would any of you guys be able to send us your training job package (Predict -> Run Training... -> Export training job package...) and send it to sleap@princeton.edu?

Thanks!!

Talmo

AbedNashef commented 3 years ago

Hi Talmo, I send you my training package. Thanks for the response and the help!

talmo commented 3 years ago

Quick update: Identified the bug and will be pushing out a new version with a fix soon.

Will post an update here when it's up.

Thanks guys!

talmo commented 3 years ago

This is now fixed in SLEAP v1.1.5: https://github.com/murthylab/sleap/releases/tag/v1.1.5

Closing but feel free to reply if this is still an issue.