keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
62.15k stars 19.49k forks source link

Keras 3.7 Broke My Code #20568

Open das-apratim opened 2 days ago

das-apratim commented 2 days ago

Hello Devs,

I am trying to Impliment the Keras Deel:abV3 Segmentation https://keras.io/keras_hub/guides/semantic_segmentation_deeplab_v3/ on Custom Dataset

With Following Changes:

  1. Classes: 2
  2. Image Size (1024,1024)

In Keras V 3.6 there were no issues while training, but since last release i.e. keras 3.7, after 107 Steps in first Epcoh I started getting loss: nan, but as soon as I reverted back the version to 3.6 all was good.

To Resolve the issue with 3.7 I tried multiple approaces:

  1. Exploding Gradients
  2. NaN Data Points
  3. Different Optimisers

But the issue still remains. Also I neoted a new Warning in the new version

**" UserWarning: The structure ofinputsdoesn't match the expected structure: ['keras_tensor_265']. Received: the structure of inputs=(2,1024,1024,3) warnings.warn( "**

I am a novice but it will be grate if anyone can guide me through this and how to resolve this. Following is the code snippet to create the model

` INITIAL_LR = 0.007 BATCH_SIZE / 16 EPOCHS = 20 learning_rate = keras.optimizers.schedules.CosineDecay( INITIAL_LR, decay_steps=EPOCHS 2124, ) IMAGE_SIZE = 1024

strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

with strategy.scope():
    image_converter = keras_hub.layers.DeepLabV3ImageConverter(image_size = (IMAGE_SIZE,IMAGE_SIZE), interpolation="bilinear",data_format='channels_last')
    preprocessor = keras_hub.models.DeepLabV3ImageSegmenterPreprocessor(image_converter)
    image_encoder = keras_hub.models.ResNetBackbone.from_preset("resnet_50_imagenet")

    deeplab_backbone = keras_hub.models.DeepLabV3Backbone(
        image_encoder=image_encoder,
        low_level_feature_key="P2",
        spatial_pyramid_pooling_key="P5",
        dilation_rates=[6, 12, 18],
        upsampling_size=8,
    )

    model = keras_hub.models.DeepLabV3ImageSegmenter(
        backbone=deeplab_backbone,
        num_classes=NUM_CLASSES,
        activation="sigmoid",
        # activation = "relu",
        preprocessor=preprocessor,
    )

    model.load_weights("/kaggle/working/DeepLab.weights.h5")

    loss = keras.losses.CategoricalCrossentropy(from_logits=False) ## Required On Hot Encoding
    # loss = keras.losses.SparseCategoricalCrossentropy(from_logits=False) ## Does Not Require One Hot Encoding

    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=learning_rate, weight_decay=0.0001, global_clipnorm=1.0),
        loss=loss,
        metrics=[ keras.metrics.MeanIoU(num_classes=2,name="iou"),
                 # keras.metrics.IoU(num_classes=2,target_class_ids=(0,1),sparse_y_pred=True, name="iou"),\
                 # sparse_y_pred=False
                 # keras.metrics.CategoricalAccuracy(name="cat_acc", dtype=None)
            ],
    )

    early_sasving = keras.callbacks.ModelCheckpoint('/kaggle/working/DeepLab.weights.h5', verbose=1, save_weights_only=True ,\
                                                 monitor='iou',save_best_only=True, mode='auto')
    early_stopping = keras.callbacks.EarlyStopping(monitor='iou', patience=10)
# early_sasving
    history = model.fit(train_dataset,callbacks=[early_sasving,early_stopping],shuffle=True,\
                        validation_data=val_dataset, epochs=EPOCHS)

`

I am running this notebook on Kaggle using 2 x T4GPU

fchollet commented 2 days ago

Are you able to bisect the exact commit or range of commits that broke your code with a NaN?

With 146 commits it should take you at most 7 runs to identify the particular commit responsible, via bisection.

das-apratim commented 1 day ago

Hi,

So I followed your instructions and the code works till 5bec656bf00bce3272516fc60136be4caa8aa7bd and started failing at commit 28d39c0cc766767f4db54edc8b8ce68d3a05d4b4 make on 23rd Nov 2024

image

fchollet commented 1 day ago

@james77777778 @hertschuh do you guys have any thoughts as to why this commit would cause a NaN loss?

Thoughts:

@das-apratim I would also recommend that you try the JAX backend (which is a better fit anyway since you are training on TPU), with keras.distribution.DataParallel().

fchollet commented 1 day ago

This only leaves iterations on the base optimizer having aggregation="only_first_replica" and int type. Note that it could likely be switched to "none".

@das-apratim you can try changing aggregation="only_first_replica", (line 163) to aggregation="none", in base_optimizer.py, reinstall Keras, and see if that works.

das-apratim commented 1 day ago

Nope this disnt resolve the issue... same NaN after 107 Steps

Number of devices: 2 /kaggle/working/keras/keras/src/saving/saving_lib.py:757: UserWarning: Skipping variable loading for optimizer 'adam', because it has 2 variables whereas the saved optimizer has 369 variables. saveable.load_own_variables(weights_store.get(inner_path)) /kaggle/working/keras/keras/src/models/functional.py:229: UserWarning: The structure ofinputsdoesn't match the expected structure: ['keras_tensor']. Received: the structure of inputs=type(Tensor("Placeholder:0", shape=(2, 1024, 1024, 3), dtype=float32)) warnings.warn( Epoch 1/20 /kaggle/working/keras/keras/src/models/functional.py:229: UserWarning: The structure ofinputsdoesn't match the expected structure: ['keras_tensor']. Received: the structure of inputs=type(Tensor("data:0", shape=(2, 1024, 1024, 3), dtype=float32)) warnings.warn( /kaggle/working/keras/keras/src/models/functional.py:229: UserWarning: The structure ofinputsdoesn't match the expected structure: ['keras_tensor']. Received: the structure of inputs=type(Tensor("data_1:0", shape=(2, 1024, 1024, 3), dtype=float32)) warnings.warn( 115/6740 ━━━━━━━━━━━━━━━━━━━━ 2:01:30 1s/step - iou: nan - loss: 0.0234

das-apratim commented 1 day ago

Also the code dose make an attempt to run on TPU, but if fails with following error

@james77777778 @hertschuh do you guys have any thoughts as to why this commit would cause a NaN loss?

Thoughts:

  • LossScaleOptimizer isn't used here, so the aggregation="none" there would not have any impact.
  • metrics use aggregation="sum", but they cannot interact with the loss or the weights to cause a NaN loss.
  • This only leaves iterations on the base optimizer having aggregation="only_first_replica" and int type. Note that it could likely be switched to "none".

@das-apratim I would also recommend that you try the JAX backend (which is a better fit anyway since you are training on TPU), with keras.distribution.DataParallel().

As I am a newbie, can you give me some implementation example how to do this... Also My code so to run on TPU.. it gives a XLA compilation error...

I0000 00:00:1733061621.485612 13 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. /usr/local/lib/python3.10/site-packages/keras/src/saving/saving_lib.py:719: UserWarning: Skipping variable loading for optimizer 'adam', because it has 2 variables whereas the saved optimizer has 369 variables. saveable.load_own_variables(weights_store.get(inner_path)) /usr/local/lib/python3.10/site-packages/keras/src/models/functional.py:225: UserWarning: The structure ofinputsdoesn't match the expected structure: ['keras_tensor']. Received: the structure of inputs=* warnings.warn( Epoch 1/50 2024-12-01 14:01:05.766890: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:961] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node StatefulPartitionedCall. I0000 00:00:1733061668.587183 861 tpu_compilation_cache_interface.cc:441] TPU host compilation cache miss: cache_key(4d9dd4c2c786de36:0:0), session_name() I0000 00:00:1733061671.728617 861 tpu_compile_op_common.cc:507] Found 0 programs. Skip fingerprint registration. I0000 00:00:1733061671.740874 861 tpu_compile_op_common.cc:245] Compilation of 4d9dd4c2c786de36:0:0 with session name took 3.153639535s and failed E0000 00:00:1733061671.741692 861 tpu_compilation_cache_external.cc:112] Input 0 to nodeStatefulPartitionedCall/BroadcastArgs` with op BroadcastArgs must be a compile-time constant.

XLA compilation requires that operator arguments that represent shapes or dimensions be evaluated to concrete values at compile time. This error means that a shape or dimension argument could not be evaluated at compile time, usually because the value of the argument depends on a parameter to the computation, on a variable, or on a stateful operation such as a random number generator.

Stack trace for op definition: dummy_file_name:10:dummy_function_name

 [[{{function_node __inference_one_step_on_data_49343}}{{node BroadcastArgs}}]]

2024-12-01 14:01:11.741717: F tensorflow/core/tpu/kernels/tpu_program_group.cc:90] Check failed: xla_tpu_programs.size() > 0 (0 vs. 0) https://symbolize.stripped_domain/r/?trace=7b3bf8678e3c,7b3bf862a04f,59b087896eaf,59b087896eaf&map= SIGABRT received by PID 13 (TID 861) on cpu 40 from PID 13; stack trace: PC: @ 0x7b3bf8678e3c (unknown) (unknown) @ 0x7b3afce90387 928 (unknown) @ 0x7b3bf862a050 13648 (unknown) @ 0x59b087896eb0 (unknown) (unknown) https://symbolize.stripped_domain/r/?trace=7b3bf8678e3c,7b3afce90386,7b3bf862a04f,59b087896eaf&map= E1201 14:01:11.756366 861 coredump_hook.cc:442] RAW: Remote crash data gathering hook invoked. E1201 14:01:11.756380 861 client.cc:269] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec. E1201 14:01:11.756384 861 coredump_hook.cc:537] RAW: Sending fingerprint to remote end. E1201 14:01:11.756412 861 coredump_hook.cc:546] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory E1201 14:01:11.756424 861 coredump_hook.cc:598] RAW: Dumping core locally. `

And The Session Crashes... any clues why this happening?

fchollet commented 1 day ago

How to do this:

  1. Delete the 3 TF distribution related lines:
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

with strategy.scope():
  1. Add the JAX distribution lines:
from keras import distribution

distribution.set_distribution(distribution.DataParallel())
  1. Set JAX as your backend by editing ~/.keras/keras.json (you can also achieve this by adding the following code at the very beginning, before you import keras:
import os
os.environ["KERAS_BACKEND"] = "jax"
fchollet commented 1 day ago

1s/step - iou: nan - loss: 0.0234

According to your logs you don't have a NaN loss! You have a NaN metric (iou). This is entirely harmless. You can just reset the metric or something.

das-apratim commented 1 day ago

I did tried for 20 epocs, the matrices were NaN and losses were converted to NaN after some more steps.. So let me post more Logs...

james77777778 commented 22 hours ago

@james77777778 @hertschuh do you guys have any thoughts as to why this commit would cause a NaN loss?

Thoughts:

* `LossScaleOptimizer` isn't used here, so the `aggregation="none"` there would not have any impact.

* metrics use `aggregation="sum"`, but they cannot interact with the loss or the weights to cause a NaN loss.

* This only leaves `iterations` on the base optimizer having `aggregation="only_first_replica"` and int type. Note that it could likely be switched to `"none"`.

Since there is no reproducible script for debugging, this is a random guess: Before https://github.com/keras-team/keras/commit/28d39c0cc766767f4db54edc8b8ce68d3a05d4b4, the aggregation behavior might have been broken due to incorrect propagation of the aggregation attr to the variables. Essentially, the training would be an aggregation=None setting (the default value for tf.Variable), which is likely incorrect.

@das-apratim could you first try training the model without using tf.distribute.MirroredStrategy() to check if any NaNs occur?

If the training runs without issues, try adding back tf.distribute.MirroredStrategy() and modifying _map_aggregation in keras/src/backend/tensorflow/core.py as follows:

        mapping = {
            "none": tf.VariableAggregation.NONE,
            "sum": tf.VariableAggregation.NONE,
            "mean": tf.VariableAggregation.NONE,
            "only_first_replica": tf.VariableAggregation.NONE,
        }

This adjustment reflects the behavior in Keras 3.6. See if the training runs well with this change.

If it does, incrementally restore the original mapping to identify which key is causing the issue. Here's a general guideline:

Please report back your findings so we can pinpoint the root cause.