tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
186.6k stars 74.34k forks source link

Remove no_cuda_on_cpu_tap tag and fix tests #46425

Open Flamefire opened 3 years ago

Flamefire commented 3 years ago

This is using TF 2.4.0

Describe the problem

When building and running tests I stumbled over the seemingly required test tag no_cuda_on_cpu_tap which is only documented in one place:

    # Running cuda on cpu will trigger tests guarded by GOOGLE_CUDA but NCHW
    # won't be available, which result in test failures. So disable that.

This sounds like a workaround for a problem when writing the tests. The most obvious incarnation for this is e.g. //tensorflow/core/common_runtime:ring_reducer_test. That is added with tf_cuda_cc_test which adds 2 tests: One with _gpu suffix gpu tags and one without. However as the test code uses GOOGLE_CUDA to decide whether to test on CPU or GPU the CPU test version cannot be run when TF is built with CUDA, which is (AFAIK) the regular case.

Hence my suggestion would be to "fix" the tests to run only on GPU if it is tagged with gpu and pass that to the test code as a define (e.g. TF_TEST_USE_GPU)

SuryanarayanaY commented 1 year ago

Hi @Flamefire,

Apologies for the long delay.Assignee seems not working with TF team now and hence not sure of the status yet.

Could you please let us know is this still an issue in latest TF versions ?

Flamefire commented 1 year ago

Yes the same issue as described is still present in latest master.

See https://github.com/tensorflow/tensorflow/blob/a2ee21765553ae782439d833b7130dcba5897b6e/tensorflow/core/grappler/optimizers/BUILD#L137-L139 for the comment on the tag, https://github.com/tensorflow/tensorflow/blob/a2ee21765553ae782439d833b7130dcba5897b6e/tensorflow/core/common_runtime/BUILD#L2512-L2513 for the definition of the ring_reducer_test and the erronous check for GOOGLE_CUDA in that test and https://github.com/tensorflow/tensorflow/blob/a2ee21765553ae782439d833b7130dcba5897b6e/tensorflow/tensorflow.bzl#L453 for the definition of GOOGLE_CUDA depending on if TF is built with CUDA support

Venkat6871 commented 2 months ago

Hi,

Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base.

The Tensorflow team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate.

Please follow the release notes to stay up to date with the latest developments which are happening in the Tensorflow space.

Flamefire commented 2 months ago

@Venkat6871 I double checked everything and the code is unchaged in current master, so the issue is still present.