keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.64k stars 19.42k forks source link

Training a keras application doesn't work with tensorflow backend, but it does work with pytorch. #19061

Open odinsbane opened 8 months ago

odinsbane commented 8 months ago

When I train a model built with a keras.applications app using a tensorflow backend, it never finishes a batch. When I use pytorch as a back end it trains fine.

Here is a working example:

    mdl = keras.applications.MobileNet()
    op = mdl.layers[85]
    op2 = keras.layers.Conv2DTranspose(1, (32, 32), (32, 32))(op.output)
    model2 = keras.models.Model(inputs = mdl.inputs, outputs = op2)

    for layer in model2.layers:
        if layer.name == "train_me":
            print("training")
        else:
            layer.trainable = False

    x = numpy.random.random( (4, 224, 224, 3))
    y = numpy.random.random( (4, 224, 224, 1))

    model2.compile(optimizer = keras.optimizers.Adam(0.0001), loss="mean_squared_error")

    model2.fit(x, y)

If I run this with a tensorflow backend, then it never finishes. If I run it with a pytorch backend then it finishes very quickly, less than 1 second. I haven't seen the tensorflow version finish yet.

This is a warning I get from tensorflow:

2024-01-16 10:44:20.910445: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng27{k2=0,k12=-1,k13=2,k14=3,k15=0,k17=171,k18=1,k23=0} for conv (f32[1,4,224,224]{3,2,1,0}, u8[0]{0}) custom-call(f32[1,1024,32,32]{3,2,1,0}, f32[4,1024,193,193]{3,2,1,0}, f32[4]{0}), window={size=193x193 pad=192_192x192_192 rhs_reversal=1x1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", backend_config={"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0} is taking a while...

odinsbane commented 8 months ago

Actually, this simple example did finally finish with tensorflow, It took 9 minutes. If I hide the gpu by export CUDA_VISIBLE_DEVICES="" then it finishes in 5 seconds.

If I try to train just the mobile net, then tensorflow seems to work. It is only when I try the transfer learning and use another output.

sachinprasadhs commented 8 months ago

I was able to run the code successfully using TensorFlow within few seconds, here is the Gist attached for reference https://gist.github.com/sachinprasadhs/d8667509dd1ad6d22d88336eab821a0f

odinsbane commented 8 months ago

In the notebook you provided it says, the model has no trainable weights. When I ran it, the last layer was trainable. Does that make a difference?

On Wed, Jan 17, 2024, 7:55 PM Sachin Prasad @.***> wrote:

I was able to run the code successfully using TensorFlow within few seconds, here is the Gist attached for reference https://gist.github.com/sachinprasadhs/d8667509dd1ad6d22d88336eab821a0f

— Reply to this email directly, view it on GitHub https://github.com/keras-team/keras/issues/19061#issuecomment-1896446547, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2NNEOGLGTGLYDUKX67WDLYPANBRAVCNFSM6AAAAABB4QHGLGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJWGQ2DMNJUG4 . You are receiving this because you authored the thread.Message ID: @.***>

sachinprasadhs commented 8 months ago

In your code, you have set trainable=False and there is no layer with the name "train_me", so it would have 0 trainable parameters.

        if layer.name == "train_me":
            print("training")
        else:
            layer.trainable = False
 Total params: 4,277,441 (16.32 MB)
 Trainable params: 0 (0.00 B)
 Non-trainable params: 4,277,441 (16.32 MB)
odinsbane commented 8 months ago

The layer I add, I give it the name train_me so it has trainable weights. I am not sure how that got removed from the example.

On Thu, Jan 18, 2024, 12:41 AM Sachin Prasad @.***> wrote:

In your code, you have set trainable=False and there is no layer with the name "train_me", so it would have 0 trainable parameters.

    if layer.name == "train_me":
        print("training")
    else:
        layer.trainable = False

Total params: 4,277,441 (16.32 MB) Trainable params: 0 (0.00 B) Non-trainable params: 4,277,441 (16.32 MB)

— Reply to this email directly, view it on GitHub https://github.com/keras-team/keras/issues/19061#issuecomment-1897487146, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2NNEPS2LRDROZ7QDIPEL3YPBOS5AVCNFSM6AAAAABB4QHGLGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJXGQ4DOMJUGY . You are receiving this because you authored the thread.Message ID: @.***>

odinsbane commented 8 months ago

I checked again, even when there are no trainable weights this takes a long time on my computer with a tensorflow backend. It is the same on both wsl2 and linux for me.

Is the notebook your creating using the gpu?

odinsbane commented 8 months ago

A little more debugging info. If I use keras 2.15 that was installed with tensorflow then it works as expected. The main difference I see is errors from ptx.

Unsupported .version 7.8; current version is '7.5'
ptxas fatal : Ptx assembly aborted due to errors

It seems like my cuda/tensorflow drivers might not be correct for keras 3. I installed them using the recommended instructions from keras website. Eg. pip install tensorflow[and-cuda] then pip install --upgrade keras.