Documentation request for diagnosing/restarting/monitoring pipelines

cfhammill commented 5 years ago

Hi all,

I think it would be helpful if some documentation could be provided for how to monitor/diagnose/restart an ongoing pipeline, particularly in the Redis case.

Currently I connect to the redis coordinator with redis-cli and track job_running and jobs_queue keys to get a sense of what's happening.

If things break I try to clean restart with:

chmod +w store #give write access to the store
rm store/lock/* #remove all locks
ls -d store/pending* | parallel -j<ncores> 'chmod -R +w {}; rm -r {}' #remove pending jobs which can prevent running
redis-cli LTRIM "job_running" 0 0; redis-cli LPOP "job_running". As an aside I think instead I should probably use DEL or FLUSHALL, this is just my redis inexperience. If I don't do this, the "job_running" key grows with each run, potentially triggering the same job to run multiple times, and has caused failures.

This strategy was acquired through somewhat painful trial and error.

Documentation that would help me (and I suspect others):

How to inspect the contents of "job_running". GET returns some encoded data that I don't know how to decode.
Understanding what "jobs_queue" is for, everything seems to go to "job_running" after start up.
Explanation of how logging works, my STDOUT and STDERRs end up in my cluster log files, not in the store/metadata/hash-<hash>/{stdout,stderr}. Although this may be just a torque cluster idiosyncracy.
What's in metadata.db, I'm happy to poke at the SQLITE tables if that's what required. Alternatively a pointer to a human readable set of stages preferably divided into queued, running, completed, and failed.
Garbage collection and caching examples. I don't really have disk space for multiple copies of my pipeline. Alternatively I could avoid caching certain steps, but how to do so isn't immediately clear.
Tips for avoiding unnecessary re-runs. I've had to run the same pipeline on multiple hardware sets, and it seems that this has triggered re-runs (or my store manipulation tom-foolery, not sure).

cfhammill commented 5 years ago

Also a convenient way to issue interrupts to executors would be handy

dorranh commented 3 years ago

Thanks for raising this issue! We are no longer using external-executor, but I think the points here are important to keep in mind if we end up implementing distributed execution in future versions of funflow. As such I'll tag this issue and leave it up for reference.

tweag / funflow

Documentation request for diagnosing/restarting/monitoring pipelines #133