neuronets / nobrainer

A framework for developing neural network models for 3D image processing.
Other
153 stars 47 forks source link

How to train GANs #198

Open wanderine opened 2 years ago

wanderine commented 2 years ago

I'm a bit confused, for the "previous" repository progressivegan3d there was a clear way to train a GAN using a single python call. Now I get the impression that I need to copy a rather large chunk of code from a jupyter notebook and make my own "trainer" ?

satra commented 2 years ago

@wanderine - indeed alas yes, at the moment that is the case. we haven't moved the higher level abstractions yet. we were hoping to get a few more models in first before seeing what the top level abstractions would be. the easiest at this point is to save the notebook as a python file and remove the installation and data getting code, and adjust the parameters. so in essence the notebook is the training script for now.

let's keep this issue open so we can start building these abstractions in perhaps in a scikit-learn/nilearn like way.

wanderine commented 2 years ago

Not sure if I should start a new issue but I started a training that crashed after some time, any idea what the problem is?

https://github.com/wanderine/ASSIST/blob/main/3DGANsegmentation/train_multigpu.py

Resolution : 8 Transition phase 9/12500 [..............................] - ETA: 6:16 - d_loss: -0.1399 - g_loss: 0.01382022-01-10 21:23:16.103202: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them. 12500/12500 [==============================] - 1969s 154ms/step - d_loss: -3.3371 - g_loss: 8.6324 Resolution phase 12500/12500 [==============================] - 1920s 154ms/step - d_loss: -3.3643 - g_loss: 8.8337 16 Resolution : 16 Transition phase 25000/25000 [==============================] - 5079s 202ms/step - d_loss: -18.1538 - g_loss: 58.2782 Resolution phase 25000/25000 [==============================] - 5186s 207ms/step - d_loss: -18.4769 - g_loss: 59.4532 32 Resolution : 32 Transition phase 50000/50000 [==============================] - 12971s 259ms/step - d_loss: -105.3842 - g_loss: 390.0997 Resolution phase 50000/50000 [==============================] - 13257s 265ms/step - d_loss: -108.4810 - g_loss: 401.6665 2022-01-11 08:36:35.887431: F tensorflow/stream_executor/cuda/cudadnn.cc:570] Check failed: cudnnSetTensorNdDescriptor(handle.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 0 feature_map_count: 256 spatial: 4 4 4 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}

satra commented 2 years ago

@alicebizeul and @wazeerzulfikar - any ideas?

wanderine commented 2 years ago

Since the first dataset did not include volumes at 4 x 4 x 4 resolution (mistake), I re-created the dataset to include 4 x 4 x 4 but now the training does not start

8 Resolution : 4 Transition phase Traceback (most recent call last): File "/scratch/local/nobrainer/train_multigpu.py", line 93, in progressive_gan_trainer.fit( File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/nobrainer/training.py", line 136, in fit super().fit(*args, steps_per_epoch=steps_per_epoch, **kwargs) File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler raise e.with_traceback(filtered_tb) from None File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/tensorflow/python/framework/func_graph.py", line 1129, in autograph_handler raise e.ag_error_metadata.to_exception(e) ValueError: in user code:

File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/keras/engine/training.py", line 878, in train_function  *
    return step_function(self, iterator)
File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/keras/engine/training.py", line 867, in step_function  **
    outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/six.py", line 719, in reraise
    raise value
File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/keras/engine/training.py", line 860, in run_step  **
    outputs = model.train_step(data)
File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/nobrainer/training.py", line 86, in train_step
    reals_pred, labels_pred_real = self.discriminator([reals, alpha])
File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None

ValueError: Exception encountered when calling layer "discriminator" (type Discriminator).

in user code:

    File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/nobrainer/models/progressivegan.py", line 374, in call  *
        return self.discriminator_head(x)
    File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/nobrainer/models/progressivegan.py", line 320, in discriminator_head  *
        x = self.HeadDense1(x)
    File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler  **
        raise e.with_traceback(filtered_tb) from None
    File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/keras/engine/input_spec.py", line 247, in assert_input_compatibility
        raise ValueError(

    ValueError: Input 0 of layer "dense_1" is incompatible with the layer: expected axis -1of input shape to have value 2048, but received input with shape (8, 256)

Call arguments received:
  • inputs=['tf.Tensor(shape=(8, 4, 4, 4, 1), dtype=float32)', 'tf.Tensor(shape=(1,), dtype=float32)']
wazeerzulfikar commented 2 years ago

@wanderine you were correct earlier, in that the training starts from 8x8x8 resolution by default (the generator and discriminator start in 4x4x4, but we add a resolution (to 8) as the first step in training). As of now, that would be recommended.

With respect to the single train.py vs notebook guide, I guess the idea was to bring to a tunable keras like API which can be flexible and at the same time incorporate keras level features such as multigpu strategy and mixedprecision training. But as @satra mentioned, we can work on higher-level abstractions like scikit-learn using this API.

As for the original error, I am looking at what may be wrong, will update here if I get ideas. Does it work on a single gpu? Does it work with the default cross_device_ops (NcclAllReduce)?

wazeerzulfikar commented 2 years ago

Actually, it may be because of the relationship between batch size and number of gpus, I think, either a) nobrainer.dataset.get_dataset should be called within the strategy.scope() b) batch size needs to be greater than and a multiple of number of gpus (the comment there may be incorrect)

wanderine commented 2 years ago

It breaks when starting at 64 cubes, where the batch size is set to 4, and you cannot divide 4 by 8 cards. But the batch size is 1 for 128 and 256, and that can then only be used by 1 card?

I will try with 4 cards

Den tors 13 jan. 2022 kl 00:48 skrev Wazeer Zulfikar < @.***>:

Actually, it may be because of the relationship between batch size and number of gpus, I think, either a) nobrainer.dataset.get_dataset should be called within the strategy.scope() b) batch size needs to be greater than and a multiple of number of gpus (the comment there may be incorrect)

— Reply to this email directly, view it on GitHub https://github.com/neuronets/nobrainer/issues/198#issuecomment-1011555015, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABA2XSHNVST5JKF4BQEQ7WDUVYHLZANCNFSM5K6EXBWQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Anders Eklund, PhD

wanderine commented 2 years ago

The training seems to run now when the total batch size is possible to divide by the number of GPUs.

But the generated folder is empty, it saves the weights but not any volumes. What do I need to change in the code to get 10-20 volumes per resolution?

https://github.com/wanderine/ASSIST/blob/main/3DGANsegmentation/train_multigpu.py

satra commented 2 years ago

something like this and update model_dir and latents as set in your training.

import numpy as np
import matplotlib.pyplot as plt
from nilearn import plotting
import nibabel as nib

model_dir = Path("trained-models/neuronets/braingen/0.1.0")
latents = tf.random.normal((1, 1024))
model_paths = model_dir.glob("generator_res*")
fig, ax = plt.subplots(6, 1, figsize=(18, 30))
index = 0

"""
Since each generator continues training, the same latent will give rise to 
different fake brains for each generator.
"""
for model_path in sorted(model_paths, key=lambda x: int(x.name.split("_")[-1])):
    generator = tf.saved_model.load(str(model_path))
    generate = generator.signatures["serving_default"]
    img = generate(latents)["generated"]
    img = np.squeeze(img)
    img = nib.Nifti1Image(img.astype(np.uint8), np.eye(4))
    plotting.plot_anat(anat_img=img, figure=fig, axes=ax[index], 
                       draw_cross=False,
                       title=model_path.name.split("_")[-1])
    index += 1
wazeerzulfikar commented 2 years ago

@wanderine

The training seems to run now when the total batch size is possible to divide by the number of GPUs.

Is this the case if nobrainer.dataset.get_dataset is called within the strategy.scope()?

wanderine commented 2 years ago

The max value 255 seems to be hardcoded in trainer.py, thereby assuming that the data is uint8. Nifti files with values larger than 255 will be clipped to 255. In my opinion 255 should not be hard coded. Rather the code should use the maximum in the entire dataset, or something like the 99 percentile to make it less sensitive to outliers, or let the user specify a value.

satra commented 2 years ago

@wanderine - if you want to try a simpler api, you could use the ongoing work in the new api branch.

specifically take a look at this notebook