Open wanderine opened 2 years ago
@wanderine - indeed alas yes, at the moment that is the case. we haven't moved the higher level abstractions yet. we were hoping to get a few more models in first before seeing what the top level abstractions would be. the easiest at this point is to save the notebook as a python file and remove the installation and data getting code, and adjust the parameters. so in essence the notebook is the training script for now.
let's keep this issue open so we can start building these abstractions in perhaps in a scikit-learn/nilearn like way.
Not sure if I should start a new issue but I started a training that crashed after some time, any idea what the problem is?
https://github.com/wanderine/ASSIST/blob/main/3DGANsegmentation/train_multigpu.py
Resolution : 8 Transition phase 9/12500 [..............................] - ETA: 6:16 - d_loss: -0.1399 - g_loss: 0.01382022-01-10 21:23:16.103202: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them. 12500/12500 [==============================] - 1969s 154ms/step - d_loss: -3.3371 - g_loss: 8.6324 Resolution phase 12500/12500 [==============================] - 1920s 154ms/step - d_loss: -3.3643 - g_loss: 8.8337 16 Resolution : 16 Transition phase 25000/25000 [==============================] - 5079s 202ms/step - d_loss: -18.1538 - g_loss: 58.2782 Resolution phase 25000/25000 [==============================] - 5186s 207ms/step - d_loss: -18.4769 - g_loss: 59.4532 32 Resolution : 32 Transition phase 50000/50000 [==============================] - 12971s 259ms/step - d_loss: -105.3842 - g_loss: 390.0997 Resolution phase 50000/50000 [==============================] - 13257s 265ms/step - d_loss: -108.4810 - g_loss: 401.6665 2022-01-11 08:36:35.887431: F tensorflow/stream_executor/cuda/cudadnn.cc:570] Check failed: cudnnSetTensorNdDescriptor(handle.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 0 feature_map_count: 256 spatial: 4 4 4 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}
@alicebizeul and @wazeerzulfikar - any ideas?
Since the first dataset did not include volumes at 4 x 4 x 4 resolution (mistake), I re-created the dataset to include 4 x 4 x 4 but now the training does not start
8
Resolution : 4
Transition phase
Traceback (most recent call last):
File "/scratch/local/nobrainer/train_multigpu.py", line 93, in
File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/keras/engine/training.py", line 878, in train_function *
return step_function(self, iterator)
File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/keras/engine/training.py", line 867, in step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/six.py", line 719, in reraise
raise value
File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/keras/engine/training.py", line 860, in run_step **
outputs = model.train_step(data)
File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/nobrainer/training.py", line 86, in train_step
reals_pred, labels_pred_real = self.discriminator([reals, alpha])
File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
ValueError: Exception encountered when calling layer "discriminator" (type Discriminator).
in user code:
File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/nobrainer/models/progressivegan.py", line 374, in call *
return self.discriminator_head(x)
File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/nobrainer/models/progressivegan.py", line 320, in discriminator_head *
x = self.HeadDense1(x)
File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler **
raise e.with_traceback(filtered_tb) from None
File "/proj/assist/users/x_anekl/MITPGAN3DCUDA11_env/lib/python3.9/site-packages/keras/engine/input_spec.py", line 247, in assert_input_compatibility
raise ValueError(
ValueError: Input 0 of layer "dense_1" is incompatible with the layer: expected axis -1of input shape to have value 2048, but received input with shape (8, 256)
Call arguments received:
• inputs=['tf.Tensor(shape=(8, 4, 4, 4, 1), dtype=float32)', 'tf.Tensor(shape=(1,), dtype=float32)']
@wanderine you were correct earlier, in that the training starts from 8x8x8 resolution by default (the generator and discriminator start in 4x4x4, but we add a resolution (to 8) as the first step in training). As of now, that would be recommended.
With respect to the single train.py vs notebook guide, I guess the idea was to bring to a tunable keras like API which can be flexible and at the same time incorporate keras level features such as multigpu strategy and mixedprecision training. But as @satra mentioned, we can work on higher-level abstractions like scikit-learn using this API.
As for the original error, I am looking at what may be wrong, will update here if I get ideas.
Does it work on a single gpu? Does it work with the default cross_device_ops
(NcclAllReduce)?
Actually, it may be because of the relationship between batch size and number of gpus,
I think, either
a) nobrainer.dataset.get_dataset
should be called within the strategy.scope()
b) batch size needs to be greater than and a multiple of number of gpus (the comment there may be incorrect)
It breaks when starting at 64 cubes, where the batch size is set to 4, and you cannot divide 4 by 8 cards. But the batch size is 1 for 128 and 256, and that can then only be used by 1 card?
I will try with 4 cards
Den tors 13 jan. 2022 kl 00:48 skrev Wazeer Zulfikar < @.***>:
Actually, it may be because of the relationship between batch size and number of gpus, I think, either a) nobrainer.dataset.get_dataset should be called within the strategy.scope() b) batch size needs to be greater than and a multiple of number of gpus (the comment there may be incorrect)
— Reply to this email directly, view it on GitHub https://github.com/neuronets/nobrainer/issues/198#issuecomment-1011555015, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABA2XSHNVST5JKF4BQEQ7WDUVYHLZANCNFSM5K6EXBWQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
-- Anders Eklund, PhD
The training seems to run now when the total batch size is possible to divide by the number of GPUs.
But the generated folder is empty, it saves the weights but not any volumes. What do I need to change in the code to get 10-20 volumes per resolution?
https://github.com/wanderine/ASSIST/blob/main/3DGANsegmentation/train_multigpu.py
something like this and update model_dir and latents as set in your training.
import numpy as np
import matplotlib.pyplot as plt
from nilearn import plotting
import nibabel as nib
model_dir = Path("trained-models/neuronets/braingen/0.1.0")
latents = tf.random.normal((1, 1024))
model_paths = model_dir.glob("generator_res*")
fig, ax = plt.subplots(6, 1, figsize=(18, 30))
index = 0
"""
Since each generator continues training, the same latent will give rise to
different fake brains for each generator.
"""
for model_path in sorted(model_paths, key=lambda x: int(x.name.split("_")[-1])):
generator = tf.saved_model.load(str(model_path))
generate = generator.signatures["serving_default"]
img = generate(latents)["generated"]
img = np.squeeze(img)
img = nib.Nifti1Image(img.astype(np.uint8), np.eye(4))
plotting.plot_anat(anat_img=img, figure=fig, axes=ax[index],
draw_cross=False,
title=model_path.name.split("_")[-1])
index += 1
@wanderine
The training seems to run now when the total batch size is possible to divide by the number of GPUs.
Is this the case if nobrainer.dataset.get_dataset
is called within the strategy.scope()?
The max value 255 seems to be hardcoded in trainer.py, thereby assuming that the data is uint8. Nifti files with values larger than 255 will be clipped to 255. In my opinion 255 should not be hard coded. Rather the code should use the maximum in the entire dataset, or something like the 99 percentile to make it less sensitive to outliers, or let the user specify a value.
@wanderine - if you want to try a simpler api, you could use the ongoing work in the new api branch.
specifically take a look at this notebook
I'm a bit confused, for the "previous" repository progressivegan3d there was a clear way to train a GAN using a single python call. Now I get the impression that I need to copy a rather large chunk of code from a jupyter notebook and make my own "trainer" ?