Support MultiGPU training?

804609 commented 6 years ago

Hi, Does your codes support multiGPU training? It seems there is no any responses.

Create generator for train set Found 3775 images belonging to 3 classes. Create generator for val set Found 501 images belonging to 3 classes. Start model training on the last dense layer only Epoch 1/1

lishen commented 6 years ago

The multigpu part is buggy. No guarantee it will work. Are you interested in contributing?

On Thu, Oct 26, 2017 at 6:00 AM 804609 notifications@github.com wrote:

Hi, Does your codes support multiGPU training? It seems there is no any responses.

Create generator for train set Found 3775 images belonging to 3 classes. Create generator for val set Found 501 images belonging to 3 classes. Start model training on the last dense layer only Epoch 1/1

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lishen/end2end-all-conv/issues/4, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1XS4KeRIS1DwPT0Im04PrTZWRfpJYvks5swFhLgaJpZM4QHTy5 .

zccoder commented 6 years ago

@804609 I found it can. However I wonder how to get the dataset?

804609 commented 6 years ago

Hi, Lishen: I try to upgrade to Keras v2.0.9 which supports the multiGPU feature using the new multi_gpu_model () function. However the code will have the following errors.

Create generator for train set Found 3768 images belonging to 3 classes. Create generator for val set Found 496 images belonging to 3 classes. Start model training on the last dense layer only Epoch 1/1 Traceback (most recent call last): File "patch_clf_train.py", line 309, in run(args.train_dir, args.val_dir, args.test_dir, *run_opts) File "patch_clf_train.py", line 151, in run hidden_dropout2=hidden_dropout2) File "/breast_cancer/end2end-all-conv/ddsm_train/dm_keras_ext.py", line 204, in do_3stage_training verbose=2) File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 87, in wrapper return func(args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 2046, in fit_generator generator_output = next(output_generator) File "/usr/local/lib/python2.7/dist-packages/keras/utils/data_utils.py", line 518, in get raise StopIteration(e) StopIteration: can't pickle generator objects

I traced the code but cann't found the reason. Do you know how to fix this errors?

804609 commented 6 years ago

@zccoder Lishen's paper has that information.

lishen commented 6 years ago

Not clear to me. I have already made multi gpu work in private repo. But it’s somewhat unstable at the moment.

Would you post more details of your implementation? Thanks! On Fri, Nov 3, 2017 at 5:52 AM 804609 notifications@github.com wrote:

@zccoder https://github.com/zccoder Lishen's paper has that information.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/lishen/end2end-all-conv/issues/4#issuecomment-341659714, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1XS2h91T_aCQwT-p_dGWC1kJc2VSIwks5syuJYgaJpZM4QHTy5 .

zccoder commented 6 years ago

@804609 So which python files you run and how to configure it, Can you put the files here. Thank you!

804609 commented 6 years ago

@lishen , I can run parallel on 8-GPUs platform now. Please see the code I modified. You said it’s somewhat unstable, can you explain more details? How many times do you get on 8-GPUs compared to 1-GPU?

# ================= Model creation ============== #
if gpu_count > 1:
    with tf.device('/cpu:0'):
        model, preprocess_input, top_layer_nb = get_dl_model(
            net, nb_class=len(class_list), use_pretrained=use_pretrained,
            resume_from=resume_from, img_size=img_size, top_layer_nb=top_layer_nb,
            weight_decay=weight_decay, hidden_dropout=hidden_dropout,
            nb_init_filter=nb_init_filter, init_filter_size=init_filter_size,
            init_conv_stride=init_conv_stride, pool_size=pool_size,
            pool_stride=pool_stride, alpha=alpha, l1_ratio=l1_ratio,
            inp_dropout=inp_dropout)
    model, org_model = make_parallel(model, gpu_count)
else:
    model, preprocess_input, top_layer_nb = get_dl_model(
        net, nb_class=len(class_list), use_pretrained=use_pretrained,
        resume_from=resume_from, img_size=img_size, top_layer_nb=top_layer_nb,
        weight_decay=weight_decay, hidden_dropout=hidden_dropout,
        nb_init_filter=nb_init_filter, init_filter_size=init_filter_size,
        init_conv_stride=init_conv_stride, pool_size=pool_size,
        pool_stride=pool_stride, alpha=alpha, l1_ratio=l1_ratio,
        inp_dropout=inp_dropout)
    org_model = model
if featurewise_center:
    preprocess_input = None

lishen commented 6 years ago

@804609 ,

Your code doesn't make use of the new multi_gpu_model API. It uses the make_parallel function which is a "monkey patch" for multi-GPU support. You shall change it to the new function.

I found the Keras' new function work but sometimes it blows up GPU with an "resource exhausted" error even though the same code ran successfully before. I'm not sure what was the reason.

804609 commented 6 years ago

@lishen , I can't run the code to the new version or new function. I got the error "StopIteration: can't pickle generator objects" when I upgraded to 2.0.9 through the same code, but it works fine to the 2.0.8. I found some messages from fchollet/keras/issues/8368. It said an additional update on the fit_generator() leads to the use of OrderedEnqueuer instead of the GeneratorEnqueuer if the underlying generator is a sequence which can break your code.

Your code seems passing wrong class or generator to the fit_generator()? How do you patch your code?

lishen / end2end-all-conv

Support MultiGPU training? #4