This morning, I noticed NeSI file system was very slow - it took like 20 seconds to get "ls" output.
SLURM was also terribly slow. It took quite a while to get the squeue output.
Coincidentally and interestingly, I had a few instances of run_cybershake running. All of them stopped, complaining "database is locked".
We have been revisiting this many times, without fully understanding the root cause - I am convinced that it is just the unfortunate drop of IO speed, making sqlite3 believe the DB is locked as if there are other threads blocking its access. There aren't.
We run our workflow on NeSI login node that typically have many people connect to and occasionally abuse.
It could be longer, but 50 seconds of timeout should be reasonable.
This morning, I noticed NeSI file system was very slow - it took like 20 seconds to get "ls" output. SLURM was also terribly slow. It took quite a while to get the
squeue
output.Coincidentally and interestingly, I had a few instances of run_cybershake running. All of them stopped, complaining "database is locked".
We have been revisiting this many times, without fully understanding the root cause - I am convinced that it is just the unfortunate drop of IO speed, making sqlite3 believe the DB is locked as if there are other threads blocking its access. There aren't.
We run our workflow on NeSI login node that typically have many people connect to and occasionally abuse.
It could be longer, but 50 seconds of timeout should be reasonable.