kubernetes-client / java

Official Java client library for kubernetes
http://kubernetes.io/
Apache License 2.0
3.59k stars 1.92k forks source link

Copy.copyFileToPod() hangs waiting for process to complete #1822

Closed PredatorVI closed 2 years ago

PredatorVI commented 3 years ago

Client Version: 13.0.0 Kubernetes Version: 1.19.12-gke.2100 Java Version: Java 1.8.0_291 **Server

I have the copyFileFromPod() working, but copyFileToPod() hangs at Copy.java line 459 at proc.waitFor(). I've pulled source, but I don't understand the websockets well enough to know what is going on. A copy command from the command-line works fine.

I've verified that the 'tar' and 'base64' executables exist in the container:

> root@test-gcp-dev-gke-guse4a-0:/tmp# which tar
> /usr/bin/tar
> root@test-gcp-dev-gke-guse4a-0:/tmp# which base64
> /usr/bin/base64
> root@test-gcp-dev-gke-guse4a-0:/tmp# cat /etc/issue
> Ubuntu 20.04.1 LTS \n \l
> 

I've not changed the timeouts as I'd expect a simple text file copy would work within the defaults.

It never seems to timeout or throw an error so I don't have any stack traces.

Also, I tried API version 12.0.1 (throws SocketTimeout almost immediately) and 11.0.2 (hangs just like 13.0.0).

brendandburns commented 3 years ago

For some reason the untar command on the other side isn't completing. It's hard to know why. I don't think that this is a WebSockets issue, I think it is an issue with the process in the container that is unpacking the copied files (or alternately it is an issue with the input data not getting closed properly)

How are you supplying the data that is getting copied? Byte array? Local file path? Is it one file or many?

I will try to repro locally.

PredatorVI commented 3 years ago

I've started testing using a simple text file.

The test I'm doing is to first call copyFileFromPod() to grab a config file from the pod (POD:/etc/adduser.conf --> LOCAL:C:/tmp/adduser.conf). I then turn around and copy that same file using copyFileToPod() (LOCAL:C:/tmp/adduser.conf --> POD:/tmp/adduser.conf).

I am using the java.nio.file.Path variant of the method:

public void copyFileToPod(String namespace, String pod, String container, Path srcPath, Path destPath) throws ApiException, IOException

I open to suggestions on how to narrow this down more or debug it.

Thanks!!

PredatorVI commented 3 years ago

I noticed running 'ps -ef | grep tar' in the container that the path was not right. I am setting the destPath (remote ubuntu container) to Paths.get("/tmp/adduser.conf"), but it seems to be using the Windows path delimiter.

root 1344130 0 0 16:33 ? 00:00:00 sh -c base64 -d | tar -xmf - -C \tmp root 1344136 1344130 0 16:33 ? 00:00:00 tar -xmf - -C tmp

I pulled the source and added a parentPath.replace("\", "/"); just to see if that fixed it. However, it still seems to hang, but the paths look better:

root 1214170 0 0 16:33 ? 00:00:00 sh -c base64 -d | tar -xmf - -C /tmp root 12144136 1214170 0 16:33 ? 00:00:00 tar -xmf - -C /tmp

There appear to be no variants of copyFileToPod() that take a string for the remote destPath like the copyFileFromPod had for the remote srcPath so there seems to be potential issue using native Path/File references where the source and destination systems use different path delimiters.

I also noticed now that once I kill the client process, it seems to finish and the file does show up. I didn't check the content but will once I get the pod back online.

PredatorVI commented 3 years ago

File content appears to be written correctly when I stop/kill the client-side process. This is where my unfamiliarity with WebSockets isn't helping. I don't know how the EOF/end of input for the remote process (sh -c base64 -d | tar -xmf - -C /tmp) is triggered.

I tried grasping at straws:

brendandburns commented 3 years ago

Thanks for investigating! Basically the web socket should close when the process on the client side ends. In this case that process will only end when the stdin that it is reading closes (at least that's what I think should happen)

If you're willing/able to investigate further, one thing you might try is removing the base64 encoding. I don't actually think that it is necessary, and it's possible that it's causing the problem. Then instead of sh -c ... you could just run tar -xmf - -C /tmp

It's possible that spawning the extra shell (via sh -c ...) is what is causing stdin to hang open.

PredatorVI commented 3 years ago

Still the same behavior. Here is the process output showing it isn't calling 'sh -c'. Hopefully the format is correct.

root@test-gcp-dev-gke-guse4a-0:/tmp# ps -ef | grep tar root 141926 0 0 16:05 ? 00:00:00 tar -xmf - -C /tmp

Here is the updated code:

    public Future<Integer> copyFileToPodAsync(
            String namespace, String pod, String container, Path srcPath, Path destPath)
            throws ApiException, IOException {
        // Run decoding and extracting processes
        final Process proc = execCopyToPod(namespace, pod, container, destPath);

        // Send encoded archive output stream
        File srcFile = new File(srcPath.toUri());
        try (
                ArchiveOutputStream archiveOutputStream = new TarArchiveOutputStream(proc.getOutputStream());
                FileInputStream input = new FileInputStream(srcFile)) {
            ArchiveEntry tarEntry = new TarArchiveEntry(srcFile, destPath.getFileName().toString());

            archiveOutputStream.putArchiveEntry(tarEntry);
            Streams.copy(input, archiveOutputStream);
            archiveOutputStream.closeArchiveEntry();
            archiveOutputStream.finish();

            return new ProcessFuture(proc);
        }
    }

    private Process execCopyToPod(String namespace, String pod, String container, Path destPath)
            throws ApiException, IOException {
        String parentPath = destPath.getParent() != null ? destPath.getParent().toString() : ".";
        parentPath = parentPath.replace("\\", "/");
        return this.exec(
                namespace,
                pod,
                new String[]{"tar", "-xmf", "-", "-C " + parentPath},
                container,
                true,
                false);
    }

When I kill my client process however, the file does not get created as it did before. I reverted back to using 'sh -c' but left off the base64 encoding/decoding steps and the behavior goes back to hanging, but the file does get created when I kill the my client process.

PredatorVI commented 3 years ago

I wrote this hack method to test it in a more 'synchronous' way hoping to be able to debug better.

My First attempt was to explicitly read the proc.getInputStream() and the proc.getErrorStream() thinking that it needed to be consumed/flushed before the process could complete, but that didn't change the behavior so I pulled that code as it didn't seem to help.

I then split the TAR creation to writing a local temporary archive thinking maybe the TarArchiveOutputStream was causing a hang-up for some reason. That didn't help either.

The code below is basically working by doing a proc.destroy() after the try-with-resources streams are closed and always returns '0' since closing the websockets appears to allow the file creation to complete.

I have not yet found the right combination of flush()/close() calls that allow the process to exit normally. I think I've exhausted the limits of my understanding for the Copy.copyFileToPod() methods. Maybe there is an issue in the WebSocket handling? I'm just starting to look down that road.

private int copyFileToPodBruteForce(
            String namespace, String pod, String container, Path srcPath, Path destPath)
            throws ApiException, IOException {
        // Run decoding and extracting processes
        final Process proc = execCopyToPod(namespace, pod, container, destPath);

        // Send encoded archive output stream
        File srcFile = new File(srcPath.toUri());
        try (ArchiveOutputStream archiveOutputStream
                = new TarArchiveOutputStream(proc.getOutputStream());
                FileInputStream input = new FileInputStream(srcFile)) {
            ArchiveEntry tarEntry = new TarArchiveEntry(srcFile, destPath.getFileName().toString());

            archiveOutputStream.putArchiveEntry(tarEntry);
            Streams.copy(input, archiveOutputStream);
            archiveOutputStream.closeArchiveEntry();
            archiveOutputStream.flush();
            archiveOutputStream.finish();
        }
        proc.destroy();
        return 0;
    }
}
PredatorVI commented 3 years ago

The changes merged from Pull Request #1835 will allow me to use my work-around copyFileToPodBruteForce() method (see previous comment) to successfully copy files to the pod. However, the current copyFileToPod() methods still don't work (for me) as currently implemented.

I'm curious if others have issues using these methods? I've tried the copyFileToPod() against both an Ubuntu 20.04 and Alpine Linux 3.12 image running in our Google GKE 1.19.12-gke.2100 cluster.

Currently, the method copyFileToPod() has the line

int exit = copyFileToPodAsync(namespace, pod, container, srcPath, destPath).get();

effectively

int exit = ProcessFuture<Integer>.get();

In the get() method, it calls proc.waitFor(); and ultimately is stuck on this.latch.await() in the ExecProcess class waiting for the latch counter to decrement.

My working theory is that the only way it will exit is if this.latch.countDown() is called and the only time it is called is when the following are called:

Without an explicit proc.destroy() (ignoring the failure() case) the remote process will never exit and send a message via stream=3. There does not appear to be any other way to cause the latch.countDown() to be called

So barring some other mechanism to signal to the remote exec process that the input stream is done/closed (equivalent of CTRL-D?), calling destroy() seems to be the only way for this to work.

I did try sending the character equivalent of CTRL-D and closing the OutputStream without success. The only thing that seems to work ultimately is to close the actual socket that only happens if destroy() is called.

mindcrime commented 3 years ago

FWIW, I am having this same issue using copyFileToPod( String, String, String, Path, Path) using the Java API, version 13.0.1-SNAPSHOT.

As soon as the client code hits the copyFileToPod line, it hangs there apparently indefinitely. Only when I kill the client side process does the file get written in the container. The container in question is the only container in the pod and is running the fedora:latest image. The Kubernetes cluster in question is an Azure AKS cluster.

OTOH, copyFileToPodAsync() has no problem. I just took the Future it returned, wrapped it in a while(true) loop with a check on isDone() and everything worked as expected.

BenderRodrigez commented 3 years ago

I have encountered the same issues in my application too. I was trying to build simple tool which would watch local directory and copy changed content into specific pod. But, unfortunately, all attempts to use Copy mechanism failed because the waitFor() was stuck indefinitely.

I also tried to use self-written exec analog, using the same tar mechanism but the result was the same.

My last attempt was to simply pull the file from the server using wget (it's more nasty, but no messing with streaming content through sockets at least). The result was also quite disappointing:

java.net.SocketException: Connection or outbound has been closed
        at java.base/sun.security.ssl.SSLSocketOutputRecord.deliver(SSLSocketOutputRecord.java:267)
        at java.base/sun.security.ssl.SSLSocketImpl$AppOutputStream.write(SSLSocketImpl.java:1224)
        at okio.OutputStreamSink.write(JvmOkio.kt:53)
        at okio.AsyncTimeout$sink$1.write(AsyncTimeout.kt:103)
        at okio.RealBufferedSink.flush(RealBufferedSink.kt:267)
        at okhttp3.internal.ws.WebSocketWriter.writeControlFrame(WebSocketWriter.kt:142)
        at okhttp3.internal.ws.WebSocketWriter.writeClose(WebSocketWriter.kt:102)
        at okhttp3.internal.ws.RealWebSocket.writeOneFrame$okhttp(RealWebSocket.kt:533)
        at okhttp3.internal.ws.RealWebSocket$WriterTask.runOnce(RealWebSocket.kt:620)
        at okhttp3.internal.concurrent.TaskRunner.runTask(TaskRunner.kt:116)
        at okhttp3.internal.concurrent.TaskRunner.access$runTask(TaskRunner.kt:42)
        at okhttp3.internal.concurrent.TaskRunner$runnable$1.run(TaskRunner.kt:65)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

This is a list of the processes which were not able to terminate, left after few attempts to run simple wget command. Sending kill -9 didn't killed the process. And they remain in Zombie state.

/usr/local/tomcat/webapps # ps -ef
PID   USER     TIME  COMMAND
    1 root     23:44 /usr/lib/jvm/default-jvm/bin/java
  709 root      0:00 sh
 1323 root      0:00 sh
 1993 root      0:00 [ssl_client]
 1995 root      0:00 [ssl_client]
 2239 root      0:00 [wget]
 2240 root      0:00 [ssl_client]
 2253 root      0:00 [wget]
 2254 root      0:00 [ssl_client]
 2270 root      0:00 [ssl_client]
 2348 root      0:00 [ssl_client]
 2354 root      0:00 [ssl_client]
 2377 root      0:00 [ssl_client]
 2390 root      0:00 [ssl_client]
 2476 root      0:00 [ssl_client]
 2486 root      0:00 [ssl_client]
 2495 root      0:00 [ssl_client]
 5853 root      0:00 [ssl_client]
 6032 root      0:00 sh
 6102 root      0:00 [ssl_client]
 8112 root      0:00 sh
 8186 root      0:00 [ssl_client]
 8253 root      0:00 [ssl_client]
 8261 root      0:00 [ssl_client]
 8610 root      0:00 ps -ef

My cluster is AWS EKS with Kubernetes version 1.19 and I tested it on latest version of the client at the moment 13.0.1.

pflueras commented 3 years ago

tar command works just fine locally:

    Path destPath = Paths.get("/tmp/fromFile");
    final String[] tarCommand = {"sh", "-c", "tar xmf - -C " + destPath.getParent().toString()};
    final Process tarProcess = new ProcessBuilder(tarCommand).start();

    File srcFile = new File(Paths.get("/tmp/toFile").toUri());
    try (OutputStream tarOutputStream = tarProcess.getOutputStream();
         ArchiveOutputStream archiveOutputStream = new TarArchiveOutputStream(tarOutputStream);
         FileInputStream inputStream = new FileInputStream(srcFile)) {
      ArchiveEntry tarEntry = new TarArchiveEntry(srcFile, destPath.getFileName().toString());
      archiveOutputStream.putArchiveEntry(tarEntry);
      IOUtils.copy(inputStream, archiveOutputStream);
      archiveOutputStream.closeArchiveEntry();
      archiveOutputStream.finish();
    }

I think the issue is: https://github.com/kubernetes/kubernetes/issues/89899 Basically remote exec command does not detect the end of STDIN.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes-client/java/issues/1822#issuecomment-1094141623): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
imilos commented 2 years ago

I can confirm that this issue still exists in the newest version 16.0.0 of client-java. I am using the local installation of K8s 1.22.

AlexMAS commented 1 year ago

Yes, this issue is still exists in 2023 :)

The reason is in that the TAR process is waiting for the end of its input stream (EOF). And this process is infinite because it never gets EOF.

When we start (un)TAR process manually we do something like this:

tar -xmf - -C . < archive.tar

which effectively the same as:

cat archive.tar | tar -xmf - -C .

As you can see here we have a pipeline. One process (OS itself or cat) provides data, another (tar) consumes it. As soon as the provider completes the pipe closes, and the consumer gets EOF and terminates.

In our case the copyFileToPod() method never finishes because it never closes the pipe (i.e. closes WS connection?!). Thus we stuck infinitely in proc.waitFor()...

Destroying the process after copying doesn't work - the TAR process lives remotely until the client app termination.

swzaaaaaaa commented 5 months ago

In 2024, has this problem not been solved yet? Is there a good solution for the big guy?

swzaaaaaaa commented 5 months ago

The changes merged from Pull Request #1835 will allow me to use my work-around copyFileToPodBruteForce() method (see previous comment) to successfully copy files to the pod. However, the current copyFileToPod() methods still don't work (for me) as currently implemented.

I'm curious if others have issues using these methods? I've tried the copyFileToPod() against both an Ubuntu 20.04 and Alpine Linux 3.12 image running in our Google GKE 1.19.12-gke.2100 cluster.

Currently, the method copyFileToPod() has the line

int exit = copyFileToPodAsync(namespace, pod, container, srcPath, destPath).get();

effectively

int exit = ProcessFuture<Integer>.get();

In the get() method, it calls proc.waitFor(); and ultimately is stuck on this.latch.await() in the ExecProcess class waiting for the latch counter to decrement.

My working theory is that the only way it will exit is if this.latch.countDown() is called and the only time it is called is when the following are called:

  • Exec.ExecProcess.streamHandler.handleMessage() // called for stream id=3 (remote exec completes/returns status)
  • Exec.ExecProcess.streamHandler.failure() //Via low level exception handling?
  • Exec.ExecProcess.streamHandler.close() //Closes stream handler

Without an explicit proc.destroy() (ignoring the failure() case) the remote process will never exit and send a message via stream=3. There does not appear to be any other way to cause the latch.countDown() to be called

So barring some other mechanism to signal to the remote exec process that the input stream is done/closed (equivalent of CTRL-D?), calling destroy() seems to be the only way for this to work.

I did try sending the character equivalent of CTRL-D and closing the OutputStream without success. The only thing that seems to work ultimately is to close the actual socket that only happens if destroy() is called.

Excuse me, is there a good solution now?

swzaaaaaaa commented 5 months ago

tar command works just fine locally:

    Path destPath = Paths.get("/tmp/fromFile");
    final String[] tarCommand = {"sh", "-c", "tar xmf - -C " + destPath.getParent().toString()};
    final Process tarProcess = new ProcessBuilder(tarCommand).start();

    File srcFile = new File(Paths.get("/tmp/toFile").toUri());
    try (OutputStream tarOutputStream = tarProcess.getOutputStream();
         ArchiveOutputStream archiveOutputStream = new TarArchiveOutputStream(tarOutputStream);
         FileInputStream inputStream = new FileInputStream(srcFile)) {
      ArchiveEntry tarEntry = new TarArchiveEntry(srcFile, destPath.getFileName().toString());
      archiveOutputStream.putArchiveEntry(tarEntry);
      IOUtils.copy(inputStream, archiveOutputStream);
      archiveOutputStream.closeArchiveEntry();
      archiveOutputStream.finish();
    }

I think the issue is: kubernetes/kubernetes#89899 Basically remote exec command does not detect the end of STDIN.

Excuse me, is there a good solution now

danmoldo commented 4 months ago

This issue is still present in version 20.0.1 Anyone got a workaround ?

guillaume-delalondre commented 2 months ago

This issue is still present in version 20.0.1 Anyone got a workaround ?

As a WA, I've copied the method from 1.0.1 in my class:

   private void copyFileToPod(String namespace, String pod, String container, Path srcPath, Path destPath)
         throws ApiException, IOException
   {
      // Run decoding and extracting processes
      final Process proc = execCopyToPod(namespace, pod, container, destPath);

      // Send encoded archive output stream
      File srcFile = new File(srcPath.toUri());
      try (ArchiveOutputStream archiveOutputStream = new TarArchiveOutputStream(
            new Base64OutputStream(proc.getOutputStream(), true, 0, null));
            FileInputStream input = new FileInputStream(srcFile))
      {
         ArchiveEntry tarEntry = new TarArchiveEntry(srcFile, destPath.getFileName().toString());

         archiveOutputStream.putArchiveEntry(tarEntry);
         ByteStreams.copy(input, archiveOutputStream);
         archiveOutputStream.closeArchiveEntry();
      }
      finally
      {
         proc.destroy();
      }
   }