keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.61k stars 19.42k forks source link

intermittent test case failed on tensorflow gpu env #20027

Open shashaka opened 1 month ago

shashaka commented 1 month ago

On keras/src/trainers/data_adapters/generator_data_adapter_test.py, I found that there is intermittent test case failed on tensorflow gpu env. This is related in test_basic_flow method on this test case, so, I made test code for this on my local side.

import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import math

import jax
import numpy as np
import tensorflow as tf
import torch
from absl.testing import parameterized
from jax import numpy as jnp

from keras.src import backend
from keras.src import testing
from keras.src.trainers.data_adapters import generator_data_adapter
def example_generator(x, y, sample_weight=None, batch_size=32):
    def make():
        for i in range(math.ceil(len(x) / batch_size)):
            low = i * batch_size
            high = min(low + batch_size, len(x))
            batch_x = x[low:high]
            batch_y = y[low:high]
            if sample_weight is not None:
                yield batch_x, batch_y, sample_weight[low:high]
            else:
                yield batch_x, batch_y

    return make

class TestCase(testing.TestCase, parameterized.TestCase):

    def test_basic_flow(self, use_sample_weight, generator_type):
        x = np.random.random((34, 4)).astype("float32")
        y = np.array([[i, i] for i in range(34)], dtype="float32")
        sw = np.random.random((34,)).astype("float32")
        if generator_type == "tf":
            x, y, sw = tf.constant(x), tf.constant(y), tf.constant(sw)
        elif generator_type == "jax":
            x, y, sw = jnp.array(x), jnp.array(y), jnp.array(sw)
        elif generator_type == "torch":
            x, y, sw = (
                torch.as_tensor(x),
                torch.as_tensor(y),
                torch.as_tensor(sw),
            )
        if not use_sample_weight:
            sw = None
        make_generator = example_generator(
            x,
            y,
            sample_weight=sw,
            batch_size=16,
        )

        adapter = generator_data_adapter.GeneratorDataAdapter(make_generator())
        if backend.backend() == "numpy":
            it = adapter.get_numpy_iterator()
            expected_class = np.ndarray
        elif backend.backend() == "tensorflow":
            it = adapter.get_tf_dataset()
            expected_class = tf.Tensor
        elif backend.backend() == "jax":
            it = adapter.get_jax_iterator()
            expected_class = (
                jax.Array if generator_type == "jax" else np.ndarray
            )
        elif backend.backend() == "torch":
            it = adapter.get_torch_dataloader()
            expected_class = torch.Tensor

        sample_order = []
        for i, batch in enumerate(it):
            if use_sample_weight:
                self.assertEqual(len(batch), 3)
                bx, by, bsw = batch
            else:
                self.assertEqual(len(batch), 2)
                bx, by = batch
            self.assertIsInstance(bx, expected_class)
            self.assertIsInstance(by, expected_class)
            self.assertEqual(bx.dtype, by.dtype)
            self.assertContainsExactSubsequence(str(bx.dtype), "float32")
            if i < 2:
                self.assertEqual(bx.shape, (16, 4))
                self.assertEqual(by.shape, (16, 2))
            else:
                self.assertEqual(bx.shape, (2, 4))
                self.assertEqual(by.shape, (2, 2))
            if use_sample_weight:
                self.assertIsInstance(bsw, expected_class)
            for i in range(by.shape[0]):
                sample_order.append(by[i, 0])
        self.assertAllClose(sample_order, list(range(34)))

        print(f"*" * 50)

for _ in range(1000):
    TestCase().test_basic_flow(True, 'tf')
    print("All passed!")

And I got an error as below, many of running was succeeded, however, some are failed.

InvalidArgumentError                      Traceback (most recent call last)
Cell In[2], line 85
     81         print(f"*" * 50)
     84 for _ in range(1000):
---> 85     TestCase().test_basic_flow(True, 'tf')
     86     print("All passed!")

Cell In[2], line 78, in TestCase.test_basic_flow(self, use_sample_weight, generator_type)
     76         self.assertIsInstance(bsw, expected_class)
     77     for i in range(by.shape[0]):
---> 78         sample_order.append(by[i, 0])
     79 self.assertAllClose(sample_order, list(range(34)))
     81 print(f"*" * 50)

File ~/miniconda3/envs/keras/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    151 except Exception as e:
    152   filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153   raise e.with_traceback(filtered_tb) from None
    154 finally:
    155   del filtered_tb

File ~/miniconda3/envs/keras/lib/python3.10/site-packages/tensorflow/python/framework/ops.py:5983, in raise_from_not_ok_status(e, name)
   5981 def raise_from_not_ok_status(e, name) -> NoReturn:
   5982   e.message += (" name: " + str(name if name is not None else ""))
-> 5983   raise core._status_to_exception(e) from None

InvalidArgumentError: {{function_node __wrapped__StridedSlice_device_/job:localhost/replica:0/task:0/device:GPU:0}} Expected begin, end, and strides to be 1D equal size tensors, but got shapes [2], [1], and [1] instead. [Op:StridedSlice] name: strided_slice/

So, is there anyone can confirm whether this is the bug or not??

sachinprasadhs commented 1 month ago

I tried it in colab GPU runtime with the code you have provided, for all the 1000 runs I got "All Passed" message. Attaching the Gist here for reference

shashaka commented 1 month ago

@sachinprasadhs When I install the keras from the source as github master branch, I found that this issue was reproduced. Can you check the below colab notebook?

https://colab.sandbox.google.com/gist/sachinprasadhs/e73e2c7428f44ccc0d2ef486bed047c6/20027.ipynb

grasskin commented 1 month ago

Hi @shashaka, could we try and get a pared down colab of this issue? Please remove anything non-relevant to Tensorflow and to this reproduction. Please also add keras.config.disable_traceback_filtering() so we can get a full trace error.

ghsanti commented 1 month ago

Here is a simplified gist (shows the error)

(With disabled traceback filtering.)

Happens both with GPU and CPU. (It does happen only some times !)

PS: This might be obvious but without the test environment there seems to be no error (gist).

shashaka commented 1 month ago

I also updated my gist based on @ghsanti 's one. It seems that this error occurred when slicing the data on data generator.

https://colab.research.google.com/gist/shashaka/71e1e97d1459498c0bcca1fb4fc084d8/20027.ipynb

grasskin commented 1 month ago

Thank you @shashaka and @ghsanti, unless this shows up in our own testing environment (internally/github CI) we are unlikely to have the bandwidth to dive deeper into what is happening since this might be environment specific. If you're taking a closer look and find the code pointer responsible we'd be happy to support any PR's. Leaving open for now!