Kernel Crash During Transfer Learning

hebbianloop commented 4 years ago

I am attempting to perform transfer learning on the existing nobrainer model weights by synthesizing a training dataset complied with manual edits to the brain mask.

My first attempt made it through to epoch 4/5 before the kernel crashed. I've tried rerunning the code multiple times with smaller datasets, different learning rates but I keep getting the same error message:

Train for 1296 steps, validate for 80 steps
2019-12-06 10:47:11.520055: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:150] Filling up shuffle buffer (this may take a while): 1 of 10
2019-12-06 10:47:12.065779: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:199] Shuffle buffer filled.
Killed: 9

My code is below, any help/suggestions would be appreciated.

# Transfer Learning to ADS FreeSurfer Brain Masks
import nobrainer
# initialize
csv_of_filepaths = './nobrainer/code/nobrainer_fs-SkullStripped_trainingdata.csv'
filepaths = nobrainer.io.read_csv(csv_of_filepaths)
# split into train and test
train_paths = filepaths[:324]
evaluate_paths = filepaths[324:]
# convert images to tensorflow records
nobrainer.io.convert(
    train_paths,
    tfrecords_template='./nobrainer/processed/data-train_shard-{shard:03d}.tfrecords',
    volumes_per_shard=3,
    num_parallel_calls=24)
nobrainer.io.convert(
    evaluate_paths,
    tfrecords_template='./nobrainer/processed/data-evaluate_shard-{shard:03d}.tfrecords',
    volumes_per_shard=3,
    num_parallel_calls=24)
#### preallocation for train/evaluate
n_classes = 1
batch_size = 2
volume_shape = (256, 256, 256)
block_shape = (128, 128, 128)
n_epochs = None
augment = False
shuffle_buffer_size = 10
num_parallel_calls = 24
# train object
dataset_train = nobrainer.volume.get_dataset(
    file_pattern='./nobrainer/processed/data-train_shard-*.tfrecords',
    n_classes=n_classes,
    batch_size=batch_size,
    volume_shape=volume_shape,
    block_shape=block_shape,
    n_epochs=n_epochs,
    augment=augment,
    shuffle_buffer_size=shuffle_buffer_size,
    num_parallel_calls=num_parallel_calls,
)
# evaluate object
dataset_evaluate = nobrainer.volume.get_dataset(
    file_pattern='./nobrainer/processed/data-evaluate_shard-*.tfrecords',
    n_classes=n_classes,
    batch_size=batch_size,
    volume_shape=volume_shape,
    block_shape=block_shape,
    n_epochs=1,
    augment=False,
    shuffle_buffer_size=None,
    num_parallel_calls=1,
)
##################################################
# TRANSFER LEARNING
### get existing model for transfer learning
##################################################
import tensorflow as tf
model_path = tf.keras.utils.get_file(
    fname='brain-extraction-unet-128iso-model.h5',
    origin='https://github.com/neuronets/nobrainer-models/releases/download/0.1/brain-extraction-unet-128iso-model.h5')
model = tf.keras.models.load_model(model_path, compile=False)
model.summary()
# set L2 regularization for layers
for layer in model.layers:
    layer.kernel_regularizer = tf.keras.regularizers.l2(0.01)
# set learning rate
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-05)
# compile model
model.compile(
    optimizer=optimizer,
    loss=nobrainer.losses.jaccard,
    metrics=[nobrainer.metrics.dice],
)
# compute steps given sizes
steps_per_epoch = nobrainer.volume.get_steps_per_epoch(
    n_volumes=len(train_paths),
    volume_shape=volume_shape,
    block_shape=block_shape,
    batch_size=batch_size)
validation_steps = nobrainer.volume.get_steps_per_epoch(
    n_volumes=len(evaluate_paths),
    volume_shape=volume_shape,
    block_shape=block_shape,
    batch_size=batch_size)
## TRAIN  MODEL!!!
model.fit(
    dataset_train,
    epochs=1,
    verbose=1,
    steps_per_epoch=steps_per_epoch, 
    validation_data=dataset_evaluate, 
    validation_steps=validation_steps,
    use_multiprocessing=True,
    workers=24)
model.save('./nobrainer/nobrainer-models/ads-transfer-learning_manual-edits_brain-extraction-unet-128iso-model.h5', 
            save_format='h5')
model.save_weights('./nobrainer/nobrainer-models/ads-transfer-learning_manual-edits_brain-extraction-unet-128iso-weights.h5', 
            save_format='h5')

hebbianloop commented 4 years ago

Update, not sure how this happened but I get this error message when attempting to call no brainer from the command line

  File "/Users/admin/Anvil/opt/miniconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 786, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'cloudpickle==1.1.1' distribution was not found and is required by tensorflow-probability

update - performed a reinstall and this message mysteriously disappeared. the original reported behavior is still present

kaczmarj commented 4 years ago

hi @seldamat - sorry for the delay, and thanks for the report. can you try updating nobrainer and try the above code again?

pip install -U --no-cache-dir https://github.com/neuronets/nobrainer/tarball/master

that might update dependencies, too, so it's best to do in a virtual environment or conda environment.

kaczmarj commented 4 years ago

FYI i have enhanced the tfrecords writing and reading functionality in #79. can you please refer to https://github.com/neuronets/nobrainer/blob/master/guide/transfer_learning.ipynb for how to write and read tfrecords in the new format?

kaczmarj commented 4 years ago

@seldamat - let me know if you have any updates

neuronets / nobrainer

Kernel Crash During Transfer Learning #81