ucgmsim / slurm_gm_workflow

Porting the GM workflow to run on new NeSI HPC (Maintainer: Jonney)
MIT License
0 stars 2 forks source link

Revisit to the *DB is locked* issue #529

Closed sungeunbae closed 2 months ago

sungeunbae commented 2 months ago

This morning, I noticed NeSI file system was very slow - it took like 20 seconds to get "ls" output. SLURM was also terribly slow. It took quite a while to get the squeue output.

Coincidentally and interestingly, I had a few instances of run_cybershake running. All of them stopped, complaining "database is locked".

We have been revisiting this many times, without fully understanding the root cause - I am convinced that it is just the unfortunate drop of IO speed, making sqlite3 believe the DB is locked as if there are other threads blocking its access. There aren't.

We run our workflow on NeSI login node that typically have many people connect to and occasionally abuse.

It could be longer, but 50 seconds of timeout should be reasonable.