Open natbprice opened 6 months ago
Here is Colab with reproducible example:
https://colab.research.google.com/drive/1aCJVUNfro68fek-o0i_7Iojl6Qtix6NK?usp=sharing
I can reproduce the error using just keras, so maybe I should open issue there? Or maybe it should be fixed in tensorflow? But the documentation for tensorflow.python.data.experimental.ops.distribute.compute_batch_size()
describes its limitations so not sure it is technically a bug in tensorflow.
https://colab.research.google.com/drive/1IxVNDcNoIK4SiX2wuDQKfqR_6Or9P40I#scrollTo=0Hf6qJOxXsqI
Hi @SuryanarayanaY, the related ticket in keras was closed with the recommendation that this be fixed in keras-nlp. Per @hertschuh: "One should simply apply batch_size after the map and not in _convert_inputs_to_dataset".
I can't quite figure out the best way for this to work with keras-nlp API. In particular, it seems like there are several combinations of (1) distribution strategy, (2) input types (e.g., tf.data.Dataset, NumPy arrays), and (3) batching (e.g., pre-batched dataset, explicit batch_size).
Currently, in _convert_inputs_to_dataset it will raise an error if you attempt to pass a tf.data.Dataset with explicit batch_size argument. It also looks like there is error handling to prevent you from passing unbatched inputs, but the string matching on the error message may be oudated and not functioning.
@natbprice ,
Sorry for the delay, I'm still working on this. It turned out to be more complex to fix than I expected.
@natbprice ,
I experimented with a few things, but I could not find a fix in keras-nlp that would work in all cases.
However, I do have an easy workaround: ds.batch(8, drop_remainder=True)
. By doing this, the dataset knows that the first dimension, the batch size, is always 8. Then, it can infer the first dimension of the result of other operations like map
.
If you don't do drop_remainder=True
, it thinks the last batch may be incomplete. And while you can still retrieve the batch size right after batching, it doesn't propagate through other operations like map
.
If you're concerned about not using the last few examples, you can shuffle
, or repeat
the dataset before batching.
@hertschuh do you have a working example? This doesn't seem to work for me.
Edit:
It seems like the workaround works outside of keras-nlp. Maybe there is something specific to keras-nlp that still needs to be resolved?
https://colab.research.google.com/drive/1IxVNDcNoIK4SiX2wuDQKfqR_6Or9P40I#scrollTo=0Hf6qJOxXsqI
@SuryanarayanaY can we reopen this issue please?
Describe the bug This is an issue I am having with keras-nlp, but I am not sure if it can be solved here or should be reported under keras or tensorflow.
Currently, the batch size is not calculated correctly when performing multi-worker distributed training with JAX backend:
To Reproduce Run (multi-worker?) distributed training with JAX backend.
The issue seems to stem from https://github.com/keras-team/keras-nlp/blob/778ccd72fe5d74e8eedc7d38dfb57561821b7851/keras_nlp/src/utils/pipeline_model.py#L181 where mapping a preprocessor over the dataset leads to failure at https://github.com/keras-team/keras/blob/3105247028bb0a7e6d2f05f5daa44c9cfafd3e67/keras/src/distribution/distribution_lib.py#L465
Here is minimal example where
tensorflow.python.data.experimental.ops.distribute.compute_batch_size()
returns -1 after mapping:Expected behavior A batched tf.data.Dataset() object is recognized as being batched.