Create --cleanup flag for containerized environments

timreimherr commented 4 years ago

Subject

Pipelinewise fails to run in containerized environments due to leftover pid files from previous executions. When using pipelinewise in containerized environments you need to have volumes to persist data from the import process which is then used in the run_tap process. However, leftover pid files are saved in the volume from the import process which then causes pipelinewise to think that a process is running, then it logs the message logger_name=pipelinewise.cli.pipelinewise log_level=INFO message=Tap Salesforce is currently running and the container dies.

Could you add a --cleanup flag to remove pid files before a process completes?

Your environment

Version 0.26.0
tap-saleforce
target-redshift

Steps to reproduce

Create a Docker image that contains:

Pipelinewise installed
configured tap and target files
a script to run commands to the installed pipelinewise
set up a persistent volume to store tap/target config data from the import process
run import process in Kubernetes using image and volume
run run_tap process in Kubernetes using image and volume

Expected behaviour

The data import is successful.

Actual behaviour

We get the message logger_name=pipelinewise.cli.pipelinewise log_level=INFO message=Tap Salesforce is currently running and the container dies.

koszti commented 4 years ago

Pid files are deleted automatically on successful and on failure runs and when the taps stopped by SIGINT and SIGTERM. Pids should remain only if the tap stopped by SIGKILL (kill -9). Do you know how and what stopped the running taps?

How would the --cleanup option work? Would that going through all directories in ~/.pipelinewise and deleting every pid file? My only concern with the cleanup option then if a tap is really running then it can remove the valid pid files as well and will be possible to start the same tap multiple times, which should never happen.

If we add --cleanup option then how would you avoid to run the same tap in multiple instances?

EamonKeane commented 3 years ago

@timreimherr did you ever come upon a solution for this?

It happens rarely enough that I haven't added a solution, but I think it happens when a pre-emptible node is killed by GKE.

Due to the nature of GKE and preemptible nodes, once a node is scheduled for deletion, a node receives a SIGTERM, but the underlying pod never knows it’s going to die until it is actually terminated. Again, for stateless services, this causes no concern since GKE simply spins up new nodes which the pods can be scheduled on.

My thinking is to wrap the pipelinewise execution in a script which checks for the presence of the specific files associated with that tap/target combination, and if it find .pid or *.running files it sleeps for 2 the typical execution duration (10 mins). Then it checks that the log .running file is the same size as 10 mins earlier and deletes/renames the files and continues to execute the normal pipelinewise command. With debug logs and small batch sizes, writes to the log file happen very frequently.

This is likely to cause issues only if:

The node/pod wasn't actually killed and is still running but airflow kubernetes task executor can't run health-checks and spins up a retry
The previous task continues to run but for some reason is not writing to the log file
Your pipelinewise execution duration is close to the execution frequency (I think Transferwise run like this to reduce data latency, but if you're running pipelinewise once or twice a day, less of an issue)
You are attempting to execute another task duing a long running operation such as initially syncing a large table

Perhaps a version of this logic would be robust enough to add a built-in command line option, I don't know.

This may also be an option: https://github.com/GoogleCloudPlatform/k8s-node-termination-handler

EamonKeane commented 3 years ago

Also just to mention, this issue may be resolved with this recent patch to make kubelet handle node shutdown gracefully.

https://github.com/kubernetes/kubernetes/pull/96129

iljau commented 3 years ago

Pid files are deleted automatically on successful and on failure runs and when the taps stopped by SIGINT and SIGTERM. Pids should remain only if the tap stopped by SIGKILL (kill -9).

My observation is that on SIGTERM log files are cleaned up ('Stopping gracefully...' appears in logs), but .pid file remains.

transferwise / pipelinewise