paketo-buildpacks / health-checker

A Cloud Native Buildpack that provides a health check binary compatible with Docker health checks
Apache License 2.0
6 stars 0 forks source link

Failed health checks leaves zombie processes #136

Open azatoth opened 4 months ago

azatoth commented 4 months ago

When the health check has failed, we notice defunct zombie processes are left behind

Expected Behavior

That the parent would clean up the process table

Current Behavior

everytime the healthcheck fails, a zombie is left behind.

root@ip-10-0-0-39:/workspace# THC_PORT=8080 THC_PATH=/actuator/health /layers/paketo-buildpacks_health-checker/thc/bin/thc
Error:
request error: http://localhost:8080/actuator/health: Network Error: Network Error: Error encountered in the status line: timed out reading response
root@ip-10-0-0-39:/workspace# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
cnb          1     0  7 07:41 ?        00:01:04 java org.springframework.boot.loader.launch.JarLauncher
root        42     0  0 07:41 ?        00:00:00 /managed-agents/execute-command/amazon-ssm-agent
root       112    42  0 07:41 ?        00:00:00 /managed-agents/execute-command/ssm-agent-worker
cnb        185     1  0 07:42 ?        00:00:00 [thc] <defunct>
root       186   112  0 07:43 ?        00:00:01 /managed-agents/execute-command/ssm-session-worker ecs-execute-command-s44anltuwjhhfu3vtvoadatnyu
root       194   186  0 07:43 pts/0    00:00:00 sh
cnb        201     1  0 07:43 ?        00:00:00 [thc] <defunct>
root       203   194  0 07:43 pts/0    00:00:00 bash
cnb        214     1  0 07:43 ?        00:00:00 [thc] <defunct>
cnb        224     1  0 07:43 ?        00:00:00 [thc] <defunct>
cnb        232     1  0 07:44 ?        00:00:00 [thc] <defunct>
cnb        239     1  0 07:44 ?        00:00:00 [thc] <defunct>
cnb        248     1  0 07:44 ?        00:00:00 [thc] <defunct>
cnb        256     1  0 07:45 ?        00:00:00 [thc] <defunct>
cnb        265     1  0 07:45 ?        00:00:00 [thc] <defunct>
cnb        274     1  0 07:45 ?        00:00:00 [thc] <defunct>
cnb        281     1  0 07:46 ?        00:00:00 [thc] <defunct>
cnb        481     1  0 07:53 ?        00:00:00 [thc] <defunct>
cnb        491     1  0 07:53 ?        00:00:00 [thc] <defunct>
cnb        527     1  0 07:54 ?        00:00:00 [thc] <defunct>
cnb        538     1  0 07:55 ?        00:00:00 [thc] <defunct>
cnb        546     1  0 07:55 ?        00:00:00 [thc] <defunct>
cnb        560     1  0 07:55 ?        00:00:00 [thc] <defunct>
cnb        568     1  0 07:56 ?        00:00:00 /layers/paketo-buildpacks_health-checker/thc/bin/thc
root       569   203  0 07:56 pts/0    00:00:00 ps -ef
dmikusa commented 4 months ago

The JVM is running as PID1, it's also the parent PID for the health check processes being run (I'm guessing just because it's PID1). At the same time, I doubt the JVM is set up to handle the responsibilities of PID1. PID1 is special and has to handle signal propagation and reaping zombie processes.

The only option I know of would be to insert a process, like tini that would handle the PID1 responsibilities. That's something that would have to be coordinated with Java buildpack, because that's is what's setting the start command here, and it would need a way for other buildpacks to signal that it should include a process like tini. If it exposed an option like that, then we could make the health checker buildpack tell it to include tini.

Can you elaborate on the impact here & can you share the docker health check options you're using?

I was also think that if the health check fails, usually the container would be restarted. Just trying to understand the specifics of your set up. Thanks

azatoth commented 4 months ago

So the impact is pretty small as, as you said, usually the container would be restarted; I noticed during the initial grace period which we have set to 10m, and as it didn't look right I thought its best to report it.

dmikusa commented 4 months ago

Thanks for the report, much appreciated.

Given the impact seems to be low and the effort to resolve this would be high, I'm going to leave this as is for now. I will leave this issue open though. If others are having this issue and the impact is higher, please reach out.