Splinedist training very slow

paul-hernandez-herrera commented 2 years ago

I am training Splinedist locally (Ubuntu 20.04, TensorFlow version: 2.7.0), and following an implementation similar to training.ipynb (using the code from ZeroCost4Mic)

8 images 1024x1024 for training and 2 for validation with around 2000 cells per image. Using augmentation = 5 and the following config file

Config2D(axes='YXC', backbone='unet', contoursize_max=62, grid=(2, 2), n_channel_in=1, n_channel_out=33, n_dim=2, n_params=32, net_conv_after_unet=128, net_input_shape=(None, None, 1), net_mask_shape=(None, None, 1), train_background_reg=0.0001, train_batch_size=2, train_checkpoint='weights_best.h5', train_checkpoint_epoch='weights_now.h5', train_checkpoint_last='weights_last.h5', train_completion_crop=32, train_dist_loss='mae', train_epochs=400, train_foreground_only=0.9, train_learning_rate=0.0003, train_loss_weights=(1, 0.2), train_n_val_patches=None, train_patch_size=(1024, 1024), train_reduce_lr={'factor': 0.5, 'patience': 10, 'min_delta': 0}, train_shape_completion=False, train_steps_per_epoch=100, train_tensorboard=True, unet_activation='relu', unet_batch_norm=False, unet_dropout=0.0, unet_kernel_size=(3, 3), unet_last_activation='relu', unet_n_conv_per_depth=2, unet_n_depth=3, unet_n_filter_base=32, unet_pool=(2, 2), unet_prefix='', use_gpu=False)

Training the model using: history = model.train(X_trn, Y_trn, validation_data=(X_val,Y_val), augmenter=augmenter, epochs=number_of_epochs, steps_per_epoch=number_of_steps) where number_of_epochs = 200 and number_of_steps = 25

The training time per epoch is taking too long. Is this training time normal or it is an issue?

Epoch 1/200 25/25 [==============================] - 885s 36s/step - loss: 1.2870 - prob_loss: 0.5004 - dist_loss: 3.9330 - prob_kld: 0.4349 - dist_relevant_mae: 3.8846 - dist_relevant_mse: 24.1046 - val_loss: 1.2564 - val_prob_loss: 0.3811 - val_dist_loss: 4.3764 - val_prob_kld: 0.2554 - val_dist_relevant_mae: 4.3309 - val_dist_relevant_mse: 29.6462 - lr: 3.0000e-04 Epoch 2/200 25/25 [==============================] - 860s 35s/step - loss: 0.9492 - prob_loss: 0.2596 - dist_loss: 3.4482 - prob_kld: 0.1945 - dist_relevant_mae: 3.3998 - dist_relevant_mse: 19.5444 - val_loss: 1.1119 - val_prob_loss: 0.3899 - val_dist_loss: 3.6100 - val_prob_kld: 0.2642 - val_dist_relevant_mae: 3.5645 - val_dist_relevant_mse: 21.3455 - lr: 3.0000e-04 Epoch 3/200 25/25 [==============================] - 898s 35s/step - loss: 0.8387 - prob_loss: 0.2387 - dist_loss: 3.0003 - prob_kld: 0.1708 - dist_relevant_mae: 2.9519 - dist_relevant_mse: 15.1893 - val_loss: 0.9823 - val_prob_loss: 0.3285 - val_dist_loss: 3.2689 - val_prob_kld: 0.2028 - val_dist_relevant_mae: 3.2234 - val_dist_relevant_mse: 17.6507 - lr: 3.0000e-04

sohmandal commented 2 years ago

Hi, thank you for your interest in SplineDist. To diagnose the issue that you have been facing, could you please answer/consider the following questions/comments -

Have you tried using the implementation of SplineDist from this repo? If yes, do you observe similar training time with your data?
Have you considered using a smaller patch-size than 1024x1024 for training? Having a smaller patch size should reduce the training time and the results might be of equivalent quality depending on the data that you have.
From the config that you have shared, it seems that you are using 16 control points. It might be worth to test SplineDist with lower number of control points. That should reduce the training time. Though it should be noted that the overall quality of the result might suffer if the structures present in the data is too complex to model with high precision, using that many control points.

sohmandal commented 2 years ago

Closing because of inactivity.

uhlmanngroup / splinedist

Splinedist training very slow #4