Open das-apratim opened 2 days ago
Are you able to bisect the exact commit or range of commits that broke your code with a NaN?
git checkout {commit_hash}
followed by python pip_build.py --install
With 146 commits it should take you at most 7 runs to identify the particular commit responsible, via bisection.
Hi,
So I followed your instructions and the code works till 5bec656bf00bce3272516fc60136be4caa8aa7bd and started failing at commit 28d39c0cc766767f4db54edc8b8ce68d3a05d4b4 make on 23rd Nov 2024
@james77777778 @hertschuh do you guys have any thoughts as to why this commit would cause a NaN loss?
Thoughts:
LossScaleOptimizer
isn't used here, so the aggregation="none"
there would not have any impact.aggregation="sum"
, but they cannot interact with the loss or the weights to cause a NaN loss.iterations
on the base optimizer having aggregation="only_first_replica"
and int type. Note that it could likely be switched to "none"
.@das-apratim I would also recommend that you try the JAX backend (which is a better fit anyway since you are training on TPU), with keras.distribution.DataParallel()
.
This only leaves iterations on the base optimizer having aggregation="only_first_replica" and int type. Note that it could likely be switched to "none".
@das-apratim you can try changing aggregation="only_first_replica",
(line 163) to aggregation="none",
in base_optimizer.py
, reinstall Keras, and see if that works.
Nope this disnt resolve the issue... same NaN after 107 Steps
Number of devices: 2 /kaggle/working/keras/keras/src/saving/saving_lib.py:757: UserWarning: Skipping variable loading for optimizer 'adam', because it has 2 variables whereas the saved optimizer has 369 variables. saveable.load_own_variables(weights_store.get(inner_path)) /kaggle/working/keras/keras/src/models/functional.py:229: UserWarning: The structure of
inputsdoesn't match the expected structure: ['keras_tensor']. Received: the structure of inputs=type(Tensor("Placeholder:0", shape=(2, 1024, 1024, 3), dtype=float32)) warnings.warn( Epoch 1/20 /kaggle/working/keras/keras/src/models/functional.py:229: UserWarning: The structure of
inputsdoesn't match the expected structure: ['keras_tensor']. Received: the structure of inputs=type(Tensor("data:0", shape=(2, 1024, 1024, 3), dtype=float32)) warnings.warn( /kaggle/working/keras/keras/src/models/functional.py:229: UserWarning: The structure of
inputsdoesn't match the expected structure: ['keras_tensor']. Received: the structure of inputs=type(Tensor("data_1:0", shape=(2, 1024, 1024, 3), dtype=float32)) warnings.warn( 115/6740 ━━━━━━━━━━━━━━━━━━━━ 2:01:30 1s/step - iou: nan - loss: 0.0234
Also the code dose make an attempt to run on TPU, but if fails with following error
@james77777778 @hertschuh do you guys have any thoughts as to why this commit would cause a NaN loss?
Thoughts:
LossScaleOptimizer
isn't used here, so theaggregation="none"
there would not have any impact.- metrics use
aggregation="sum"
, but they cannot interact with the loss or the weights to cause a NaN loss.- This only leaves
iterations
on the base optimizer havingaggregation="only_first_replica"
and int type. Note that it could likely be switched to"none"
.@das-apratim I would also recommend that you try the JAX backend (which is a better fit anyway since you are training on TPU), with
keras.distribution.DataParallel()
.
As I am a newbie, can you give me some implementation example how to do this... Also My code so to run on TPU.. it gives a XLA compilation error...
I0000 00:00:1733061621.485612 13 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. /usr/local/lib/python3.10/site-packages/keras/src/saving/saving_lib.py:719: UserWarning: Skipping variable loading for optimizer 'adam', because it has 2 variables whereas the saved optimizer has 369 variables. saveable.load_own_variables(weights_store.get(inner_path)) /usr/local/lib/python3.10/site-packages/keras/src/models/functional.py:225: UserWarning: The structure of
inputsdoesn't match the expected structure: ['keras_tensor']. Received: the structure of inputs=* warnings.warn( Epoch 1/50 2024-12-01 14:01:05.766890: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:961] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node StatefulPartitionedCall. I0000 00:00:1733061668.587183 861 tpu_compilation_cache_interface.cc:441] TPU host compilation cache miss: cache_key(4d9dd4c2c786de36:0:0), session_name() I0000 00:00:1733061671.728617 861 tpu_compile_op_common.cc:507] Found 0 programs. Skip fingerprint registration. I0000 00:00:1733061671.740874 861 tpu_compile_op_common.cc:245] Compilation of 4d9dd4c2c786de36:0:0 with session name took 3.153639535s and failed E0000 00:00:1733061671.741692 861 tpu_compilation_cache_external.cc:112] Input 0 to node
StatefulPartitionedCall/BroadcastArgs` with op BroadcastArgs must be a compile-time constant.
XLA compilation requires that operator arguments that represent shapes or dimensions be evaluated to concrete values at compile time. This error means that a shape or dimension argument could not be evaluated at compile time, usually because the value of the argument depends on a parameter to the computation, on a variable, or on a stateful operation such as a random number generator.
Stack trace for op definition: dummy_file_name:10:dummy_function_name
[[{{function_node __inference_one_step_on_data_49343}}{{node BroadcastArgs}}]]
2024-12-01 14:01:11.741717: F tensorflow/core/tpu/kernels/tpu_program_group.cc:90] Check failed: xla_tpu_programs.size() > 0 (0 vs. 0) https://symbolize.stripped_domain/r/?trace=7b3bf8678e3c,7b3bf862a04f,59b087896eaf,59b087896eaf&map= SIGABRT received by PID 13 (TID 861) on cpu 40 from PID 13; stack trace: PC: @ 0x7b3bf8678e3c (unknown) (unknown) @ 0x7b3afce90387 928 (unknown) @ 0x7b3bf862a050 13648 (unknown) @ 0x59b087896eb0 (unknown) (unknown) https://symbolize.stripped_domain/r/?trace=7b3bf8678e3c,7b3afce90386,7b3bf862a04f,59b087896eaf&map= E1201 14:01:11.756366 861 coredump_hook.cc:442] RAW: Remote crash data gathering hook invoked. E1201 14:01:11.756380 861 client.cc:269] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec. E1201 14:01:11.756384 861 coredump_hook.cc:537] RAW: Sending fingerprint to remote end. E1201 14:01:11.756412 861 coredump_hook.cc:546] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory E1201 14:01:11.756424 861 coredump_hook.cc:598] RAW: Dumping core locally. `
And The Session Crashes... any clues why this happening?
How to do this:
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
with strategy.scope():
from keras import distribution
distribution.set_distribution(distribution.DataParallel())
~/.keras/keras.json
(you can also achieve this by adding the following code at the very beginning, before you import keras:import os
os.environ["KERAS_BACKEND"] = "jax"
1s/step - iou: nan - loss: 0.0234
According to your logs you don't have a NaN loss! You have a NaN metric (iou). This is entirely harmless. You can just reset the metric or something.
I did tried for 20 epocs, the matrices were NaN and losses were converted to NaN after some more steps.. So let me post more Logs...
@james77777778 @hertschuh do you guys have any thoughts as to why this commit would cause a NaN loss?
Thoughts:
* `LossScaleOptimizer` isn't used here, so the `aggregation="none"` there would not have any impact. * metrics use `aggregation="sum"`, but they cannot interact with the loss or the weights to cause a NaN loss. * This only leaves `iterations` on the base optimizer having `aggregation="only_first_replica"` and int type. Note that it could likely be switched to `"none"`.
Since there is no reproducible script for debugging, this is a random guess:
Before https://github.com/keras-team/keras/commit/28d39c0cc766767f4db54edc8b8ce68d3a05d4b4, the aggregation behavior might have been broken due to incorrect propagation of the aggregation attr to the variables.
Essentially, the training would be an aggregation=None
setting (the default value for tf.Variable
), which is likely incorrect.
@das-apratim could you first try training the model without using tf.distribute.MirroredStrategy()
to check if any NaNs occur?
If the training runs without issues, try adding back tf.distribute.MirroredStrategy()
and modifying _map_aggregation
in keras/src/backend/tensorflow/core.py
as follows:
mapping = {
"none": tf.VariableAggregation.NONE,
"sum": tf.VariableAggregation.NONE,
"mean": tf.VariableAggregation.NONE,
"only_first_replica": tf.VariableAggregation.NONE,
}
This adjustment reflects the behavior in Keras 3.6. See if the training runs well with this change.
If it does, incrementally restore the original mapping to identify which key is causing the issue. Here's a general guideline:
Please report back your findings so we can pinpoint the root cause.
Hello Devs,
I am trying to Impliment the Keras Deel:abV3 Segmentation https://keras.io/keras_hub/guides/semantic_segmentation_deeplab_v3/ on Custom Dataset
With Following Changes:
In Keras V 3.6 there were no issues while training, but since last release i.e. keras 3.7, after 107 Steps in first Epcoh I started getting loss: nan, but as soon as I reverted back the version to 3.6 all was good.
To Resolve the issue with 3.7 I tried multiple approaces:
But the issue still remains. Also I neoted a new Warning in the new version
**" UserWarning: The structure of
inputsdoesn't match the expected structure: ['keras_tensor_265']. Received: the structure of inputs=(2,1024,1024,3) warnings.warn( "**
I am a novice but it will be grate if anyone can guide me through this and how to resolve this. Following is the code snippet to create the model
` INITIAL_LR = 0.007 BATCH_SIZE / 16 EPOCHS = 20 learning_rate = keras.optimizers.schedules.CosineDecay( INITIAL_LR, decay_steps=EPOCHS 2124, ) IMAGE_SIZE = 1024
`
I am running this notebook on Kaggle using 2 x T4GPU