Open cfhammill opened 5 years ago
Also a convenient way to issue interrupts to executors would be handy
Thanks for raising this issue! We are no longer using external-executor
, but I think the points here are important to keep in mind if we end up implementing distributed execution in future versions of funflow. As such I'll tag this issue and leave it up for reference.
Hi all,
I think it would be helpful if some documentation could be provided for how to monitor/diagnose/restart an ongoing pipeline, particularly in the Redis case.
Currently I connect to the redis coordinator with redis-cli and track
job_running
andjobs_queue
keys to get a sense of what's happening.If things break I try to clean restart with:
chmod +w store #give write access to the store
rm store/lock/* #remove all locks
ls -d store/pending* | parallel -j<ncores> 'chmod -R +w {}; rm -r {}' #remove pending jobs which can prevent running
redis-cli LTRIM "job_running" 0 0; redis-cli LPOP "job_running"
. As an aside I think instead I should probably useDEL
orFLUSHALL
, this is just my redis inexperience. If I don't do this, the "job_running" key grows with each run, potentially triggering the same job to run multiple times, and has caused failures.This strategy was acquired through somewhat painful trial and error.
Documentation that would help me (and I suspect others):
GET
returns some encoded data that I don't know how to decode.store/metadata/hash-<hash>/{stdout,stderr}
. Although this may be just a torque cluster idiosyncracy.metadata.db
, I'm happy to poke at the SQLITE tables if that's what required. Alternatively a pointer to a human readable set of stages preferably divided into queued, running, completed, and failed.