Open timreimherr opened 4 years ago
Pid files are deleted automatically on successful and on failure runs and when the taps stopped by SIGINT
and SIGTERM
. Pids should remain only if the tap stopped by SIGKILL
(kill -9). Do you know how and what stopped the running taps?
How would the --cleanup
option work? Would that going through all directories in ~/.pipelinewise
and deleting every pid file? My only concern with the cleanup option then if a tap is really running then it can remove the valid pid files as well and will be possible to start the same tap multiple times, which should never happen.
If we add --cleanup
option then how would you avoid to run the same tap in multiple instances?
@timreimherr did you ever come upon a solution for this?
It happens rarely enough that I haven't added a solution, but I think it happens when a pre-emptible node is killed by GKE.
Due to the nature of GKE and preemptible nodes, once a node is scheduled for deletion, a node receives a SIGTERM, but the underlying pod never knows it’s going to die until it is actually terminated. Again, for stateless services, this causes no concern since GKE simply spins up new nodes which the pods can be scheduled on.
My thinking is to wrap the pipelinewise execution in a script which checks for the presence of the specific files associated with that tap/target combination, and if it find .pid
or *.running
files it sleeps for 2 the typical execution duration (10 mins). Then it checks that the log .running file is the same size as 10 mins earlier and deletes/renames the files and continues to execute the normal pipelinewise command. With debug logs and small batch sizes, writes to the log file happen very frequently.
This is likely to cause issues only if:
Perhaps a version of this logic would be robust enough to add a built-in command line option, I don't know.
This may also be an option: https://github.com/GoogleCloudPlatform/k8s-node-termination-handler
Also just to mention, this issue may be resolved with this recent patch to make kubelet handle node shutdown gracefully.
Pid files are deleted automatically on successful and on failure runs and when the taps stopped by
SIGINT
andSIGTERM
. Pids should remain only if the tap stopped bySIGKILL
(kill -9).
My observation is that on SIGTERM log files are cleaned up ('Stopping gracefully...'
appears in logs), but .pid file remains.
Subject
Pipelinewise fails to run in containerized environments due to leftover pid files from previous executions. When using pipelinewise in containerized environments you need to have volumes to persist data from the
import
process which is then used in therun_tap
process. However, leftover pid files are saved in the volume from theimport
process which then causes pipelinewise to think that a process is running, then it logs the messagelogger_name=pipelinewise.cli.pipelinewise log_level=INFO message=Tap Salesforce is currently running
and the container dies.Could you add a
--cleanup
flag to remove pid files before a process completes?Your environment
Steps to reproduce
Create a Docker image that contains:
import
processimport
process in Kubernetes using image and volumerun_tap
process in Kubernetes using image and volumeExpected behaviour
The data import is successful.
Actual behaviour
We get the message
logger_name=pipelinewise.cli.pipelinewise log_level=INFO message=Tap Salesforce is currently running
and the container dies.