[Nvidia GPU] Add mechanism to detect nccl timeout and return error status - Githubissues

openxla / xla

A machine learning compiler for GPUs, CPUs, and ML accelerators

Apache License 2.0

2.64k stars 418 forks source link

[Nvidia GPU] Add mechanism to detect nccl timeout and return error status #14897

Open Tixxx opened 3 months ago

Tixxx commented 3 months ago

The current behavior crashes the program whenever a nccl async error has occured, timeout errors are also not detected for async events. This pr adds a mechanism to do:

poll statuses of async events and return timeout if status is pending for too long
return nccl async event status as xla status so a proper python exception can be thrown.

thomasjoerg commented 2 months ago

Assigning new reviewer, because Levon is OOO now.

golechwierowicz commented 2 months ago

@frgossen can you take a look?

Tixxx commented 2 months ago

@frgossen Could you take another look at this PR? Thanks

Tixxx commented 2 months ago

Do you think we can add a test case for this? We could trigger a NCCL deadlock in a test case

Artifacting a collective deadlock might not be easy with a simple test case. I have been trying to get an hlo with send/recv to deadlock but im hitting a bunch of errors in hlo verifier, looks like they have to go through some specific passes that add attributes for the compiler to not complain. I will try some more.

sgerrard commented 2 months ago

@Tixxx Do you have an updated status?

Tixxx commented 2 months ago

@Tixxx Do you have an updated status?

I'm able to repro a deadlock using a recv program and add it to the e2e test suite, @frgossen could you take another look?

Tixxx commented 2 months ago

@frgossen addressed your comments above, can you take another look?

Tixxx commented 1 month ago

Could you please help take a look at CI failures?

Also, when I ran the CollectiveOpsTestE2E.NcclErrorPropagation test internally, I got SIGSEGV (general-protection fault). Does your test require a specific driver version, etc?

Im using 535.161.07 cuda 12.5, also im running this on hopper. Which platform and cuda version did you observe the SIGSEGV?

Tixxx commented 1 month ago

Could you please help take a look at CI failures?

Also, when I ran the CollectiveOpsTestE2E.NcclErrorPropagation test internally, I got SIGSEGV (general-protection fault). Does your test require a specific driver version, etc?

@penpornk I finally got a repro with a different nccl version. I think the issue is that when the gpu executable is shutting down, nccl hasn't properly torn down yet. I added a small delay before exiting the gpu executable. The test is passing for all platforms for me, can you give it another try? Also set XLA_FLAGS=--xla_gpu_nccl_termination_timeout_seconds=5 so bazel doesnt timeout

Tixxx commented 1 month ago

@frgossen @penpornk i have addressed the segsev issue for the test. Can you guys give it a try and take another look at the pr? There hasn't been any progress for the past few days.

frgossen commented 1 month ago

I added a small delay before exiting the gpu executable I don't think we should rely on a small delay. This will almost certainly lead to flaky tests. Can you try to find out why exactly the issue appears?

Tixxx commented 1 month ago

I added a small delay before exiting the gpu executable

I don't think we should rely on a small delay. This will almost certainly lead to flaky tests. Can you try to find out why exactly the issue appears?

@frgossen The root cause is that we are aborting nccl comms in a separate monitor thread, when the main thread shuts down the gpu executable, nccl comms are not all cleaned up yet. Since the test infra runs the same test multiple times for different platforms consecutively next to each other, subsequent runs will crash due to previous nccl comms are not cleaned up in time. We can do a refactoring to move the nccl comms aborting logic to gpu executable and wait for the nccl abort call to finish before exiting.

akuegel commented 1 month ago

It seems there is now a merge conflict. Can you please resolve this?

Tixxx commented 1 month ago

It seems there is now a merge conflict. Can you please resolve this?

yes, just rebased

frgossen commented 1 month ago

@frgossen The root cause is that we are aborting nccl comms in a separate monitor thread, when the main thread shuts down the gpu executable, nccl comms are not all cleaned up yet. Since the test infra runs the same test multiple times for different platforms consecutively next to each other, subsequent runs will crash due to previous nccl comms are not cleaned up in time. We can do a refactoring to move the nccl comms aborting logic to gpu executable and wait for the nccl abort call to finish before exiting.

Sounds much cleaner indeed. Let's do that instead of waiting

Tixxx commented 1 month ago

@frgossen The root cause is that we are aborting nccl comms in a separate monitor thread, when the main thread shuts down the gpu executable, nccl comms are not all cleaned up yet. Since the test infra runs the same test multiple times for different platforms consecutively next to each other, subsequent runs will crash due to previous nccl comms are not cleaned up in time. We can do a refactoring to move the nccl comms aborting logic to gpu executable and wait for the nccl abort call to finish before exiting.

Sounds much cleaner indeed. Let's do that instead of waiting

@frgossen i have done a slightly different change than doing refactoring, I'm using a boolean to indicate the status of commAbort of all communicators, the flag is used in gpu executable to wait for all comms to be aborted before exiting so we dont have to use a fixed delay to wait. This is a bit simpler than refactoring the whole checkComm logic.

Tixxx commented 1 month ago

@frgossen can we expedite the review on this one? The APIs in stream executor keep changing and i have to change a number of places in cuda driver and executor to make this PR work as intended.

frgossen commented 1 month ago

@frgossen can we expedite the review on this one? The APIs in stream executor keep changing and i have to change a number of places in cuda driver and executor to make this PR work as intended.

I am having a hard time finding the relevant changes in this PR. I guess it is really a Gitgub issue but can you somehow remove all those unrelated formatting changes?

slightly different change than doing refactoring, I'm using a boolean to indicate the status

Wouldn't the right way to do this be some kind of a lock?

Tixxx commented 4 weeks ago

@frgossen can we expedite the review on this one? The APIs in stream executor keep changing and i have to change a number of places in cuda driver and executor to make this PR work as intended.

I am having a hard time finding the relevant changes in this PR. I guess it is really a Gitgub issue but can you somehow remove all those unrelated formatting changes?

Sure i have removed the formatting changes however the clang format would fail on github, i guess they need to be fixed during import.

slightly different change than doing refactoring, I'm using a boolean to indicate the status

Wouldn't the right way to do this be some kind of a lock?

The flag is set to true after all communicators are aborted and NcclCommAbort is a synchronous call, using a light-weight flag should be sufficient here or do you see other circumstances where this would fail?

thomasjoerg commented 3 weeks ago

@frgossen Please have another look.

derdrdirk commented 6 days ago

Hi @Tixxx, I saw that you added an#if GOOGLE_CUDA || TENSORFLOW_USE_ROCM Macro. The XLA team is actively working to reduce the number of these macros in the codebase as they hurt maintainability.

Would it be possible to remove the Macro?

# gpu_executable.cc
#if GOOGLE_CUDA || TENSORFLOW_USE_ROCM
    while (!se::gpu::AsGpuStream(stream_to_sync)->IsIdle()) {
      if (!async_status.async_op_status.ok()) {
        // NCCL error has occurred, wait for all communicators
        // to be aborted before returning.
        while (async_status.is_all_comms_aborted == false) {
          // Need to rendezvous while waiting for the signal in case
          // any rank is stuck when aborting.
          TF_RETURN_IF_ERROR(
              RendezvousAfterInitialization(run_options, debug_options));
        }

        return async_status.async_op_status;
      }
    }
#endif  // GOOGLE_CUDA || TENSORFLOW_USE_ROCM