jakesylvestre / cing

Automatically exported from code.google.com/p/cing
0 stars 0 forks source link

keepLockFresh background process ends prematurely #331

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
The keepLockFresh background process ends prematurely and often before the 
command in the token has finished. The background process should 'refresh' a 
topos token lock. After a slave thread has picked up a token from the topos 
token pool, the keepLockFresh process is started to prevent other slave threads 
(that can exist anywhere) from picking up the token again and start doing the 
same task.

The bug is not specific for entries and is not correlated to a specific time 
(and probably also not to timeperiods). The problem has occurred in all of the 
VMs I have started up.

An example of a vCing slave thread log file (in VM$tmp/cingTmp/) is shown 
below. I have inserted comments to indicate what is happening.

================================================================================
======================
| CING: Common Interface for NMR structure Generation version 0.95 (r1194)      
AW,JFD,GWV 2004-2012 |
================================================================================
======================
User: i          on: localhost (linux2.6.38.8-tweak/32bit/8cores/py2.7.1+) at:  
(2009) Fri Nov 23 15:02:00 2012
Starting with <Instance of Vcing, address 205696428:
    name lockTimeOut=300
    name max_time_to_wait=31536000
    name max_time_to_wait_per_job=172800
    name time_sleep_when_no_token=300
>

Going to destination: runSlaveThread with(out) arguments []
Now starting runSlaveThread
DEBUG: In nextTokenWithLock doing [<TOPOSPROG> nextTokenWithLock <TOPOSPOOL> 
300]
###
### A token is found and background process 1 is created to keep the lock fresh 
every 151 seconds (lockTimeOut/2 +1)
###
DEBUG: Created a background process [2055] keeping the lock
DEBUG: Time is 2012-11-23 15:02:04.269737
DEBUG: refreshLock will now sleep for: 151
DEBUG: In getToken doing [<TOPOSPROG> getToken <TOPOSPOOL> 22867642]
DEBUG: Got token content on try: 1: [refineEntryNrg 2krk 9 
http://<NMR>/NRG-CING/data i@<NMR>:<D>/NMR_REDO . . BY_CH23_BY_ENTRY CING 0 
auto 0 0 1]
In 0/1/1 (finished/total/iterations) got token 22867642 with lock: 
5ae3c102-3576-11e2-bc82-52540082c309
Found tokenContent: refineEntryNrg 2krk 9 http://<NMR>/NRG-CING/data 
i@<NMR>:<D>/NMR_REDO . . BY_CH23_BY_ENTRY CING 0 auto 0 0 1
DEBUG: Time is 2012-11-23 15:02:06.377634
Running payload cd /home/i/tmp/cingTmp; <SCRIPTS>refineEntry.py 2krk 9 
http://<NMR>/NRG-CING/data i@<NMR>:<D>/NMR_REDO . . BY_CH23_BY_ENTRY CING 0 
auto 0 0 1 > 22867642_refineEntryNrg_2krk_uQdZnF.log 2>&1
DEBUG: Time is 2012-11-23 15:02:06.378276
Doing 1 new process(es)
Doing command: cd /home/i/tmp/cingTmp; <SCRIPTS>refineEntry.py 2krk 9 
http://<NMR>/NRG-CING/data i@<NMR>:<D>/NMR_REDO . . BY_CH23_BY_ENTRY CING 0 
auto 0 0 1 > 22867642_refineEntryNrg_2krk_uQdZnF.log 2>&1
DEBUG: keepLockFresh doing a refreshLock
DEBUG: Time is 2012-11-23 15:04:43.861784
DEBUG: In refreshLock doing [<TOPOSPROG> refreshLock <TOPOSPOOL> 
5ae3c102-3576-11e2-bc82-52540082c309 300]
DEBUG: In refreshLock got status: 0 and result (if any) []
DEBUG: refreshLock will now sleep for: 151
DEBUG: keepLockFresh doing a refreshLock
DEBUG: Time is 2012-11-23 15:07:37.314687
DEBUG: In refreshLock doing [<TOPOSPROG> refreshLock <TOPOSPOOL> 
5ae3c102-3576-11e2-bc82-52540082c309 300]
DEBUG: In refreshLock got status: 0 and result (if any) []
DEBUG: refreshLock will now sleep for: 151
...
### 
### doing keepLockFresh doing refreshLocks
###
...
DEBUG: Time is 2012-11-23 19:59:59.524134
DEBUG: In refreshLock doing [<TOPOSPROG> refreshLock <TOPOSPOOL> 
5ae3c102-3576-11e2-bc82-52540082c309 300]
DEBUG: In refreshLock got status: 0 and result (if any) []
DEBUG: refreshLock will now sleep for: 151
DEBUG: keepLockFresh doing a refreshLock
DEBUG: Time is 2012-11-23 20:02:30.756131
DEBUG: In refreshLock doing [<TOPOSPROG> refreshLock <TOPOSPOOL> 
5ae3c102-3576-11e2-bc82-52540082c309 300]
DEBUG: In refreshLock got status: 768 and result (if any) []
###
### After 5 hours, background process 1 finishes??? 
### The lock is lost, so that other slave threads can pick it up if they are 
not busy
###
In keepLockFresh got status: 768. This indicates that the token process 
finished.
###
### However, the calculation continues normally (more than 5.5 hours in this 
case), 
### PROVIDED another slave thread in the same VM picks up the same token and 
### start messing around with the same 2krk files!
###
Finished 1 out of the 1 processes successfully
Payload returned with status: 0 and log file size 198681
DEBUG: Time is 2012-11-24 01:38:29.329974
DEBUG: cmd: ssh i@<NMR> mkdir -p <D><SDIR><TOPOSPOOL>/log
DEBUG: cmd: rsync  -ave ssh 22867642_refineEntryNrg_2krk_uQdZnF.log 
i@<NMR>:<D><SDIR><TOPOSPOOL>/log/
DEBUG: Ending with slave MSG for token: 22867642 containing refineEntryNrg 2krk 
9 http://<NMR>/NRG-CING/data i@<NMR>:<D>/NMR_REDO . . BY_CH23_BY_ENTRY CING 0 
auto 0 0 1 with message: command_finished
DEBUG: In deleteToken doing [<TOPOSPROG> deleteToken <TOPOSPOOL> 22867642]
DEBUG: cmd: ssh i@<NMR> mkdir -p <D><SDIR><TOPOSPOOL>/log2
DEBUG: cmd: rsync  -ave ssh 22867642_refineEntryNrg_2krk_QFgNEJ.log 
i@<NMR>:<D><SDIR><TOPOSPOOL>/log2/
DEBUG: Time is 2012-11-24 01:38:54.932133, finished token
###
### Again a token is found and background process 2 is created to keep the lock 
fresh
###
DEBUG: In nextTokenWithLock doing [<TOPOSPROG> nextTokenWithLock <TOPOSPOOL> 
300]
DEBUG: Created a background process [4391] keeping the lock
DEBUG: Time is 2012-11-24 01:38:59.140133
DEBUG: refreshLock will now sleep for: 151
DEBUG: In getToken doing [<TOPOSPROG> getToken <TOPOSPOOL> 22867867]
DEBUG: Got token content on try: 1: [refineEntryNrg 2lqo 9 
http://<NMR>/NRG-CING/data i@<NMR>:<D>/NMR_REDO . . BY_CH23_BY_ENTRY CING 0 
auto 0 0 1]
In 1/2/2 (finished/total/iterations) got token 22867867 with lock: 
54ba595e-35cf-11e2-bc82-52540082c309
Found tokenContent: refineEntryNrg 2lqo 9 http://<NMR>/NRG-CING/data 
i@<NMR>:<D>/NMR_REDO . . BY_CH23_BY_ENTRY CING 0 auto 0 0 1
DEBUG: Time is 2012-11-24 01:39:01.200132
Running payload cd /home/i/tmp/cingTmp; <SCRIPTS>refineEntry.py 2lqo 9 
http://<NMR>/NRG-CING/data i@<NMR>:<D>/NMR_REDO . . BY_CH23_BY_ENTRY CING 0 
auto 0 0 1 > 22867867_refineEntryNrg_2lqo_7srNK5.log 2>&1
DEBUG: Time is 2012-11-24 01:39:01.200132
Doing 1 new process(es)
Doing command: cd /home/i/tmp/cingTmp; <SCRIPTS>refineEntry.py 2lqo 9 
http://<NMR>/NRG-CING/data i@<NMR>:<D>/NMR_REDO . . BY_CH23_BY_ENTRY CING 0 
auto 0 0 1 > 22867867_refineEntryNrg_2lqo_7srNK5.log 2>&1
DEBUG: keepLockFresh doing a refreshLock
DEBUG: Time is 2012-11-24 01:41:30.242240
DEBUG: In refreshLock doing [<TOPOSPROG> refreshLock <TOPOSPOOL> 
54ba595e-35cf-11e2-bc82-52540082c309 300]
DEBUG: In refreshLock got status: 0 and result (if any) []
DEBUG: refreshLock will now sleep for: 151
DEBUG: keepLockFresh doing a refreshLock
... 
###
### doing keepLockFresh doing refreshLocks
###
...
DEBUG: Time is 2012-11-24 08:32:22.632395
DEBUG: In refreshLock doing [<TOPOSPROG> refreshLock <TOPOSPOOL> 
54ba595e-35cf-11e2-bc82-52540082c309 300]
DEBUG: In refreshLock got status: 0 and result (if any) []
DEBUG: refreshLock will now sleep for: 151
###
### Before we got status 768, the calculation finishes in time
###
Finished 1 out of the 1 processes successfully
Payload returned with status: 0 and log file size 64668
DEBUG: Time is 2012-11-24 08:34:46.868133
DEBUG: cmd: ssh i@<NMR> mkdir -p <D><SDIR><TOPOSPOOL>/log
DEBUG: cmd: rsync  -ave ssh 22867867_refineEntryNrg_2lqo_7srNK5.log 
i@<NMR>:<D><SDIR><TOPOSPOOL>/log/
DEBUG: keepLockFresh doing a refreshLock
DEBUG: Time is 2012-11-24 08:34:53.816241
DEBUG: In refreshLock doing [<TOPOSPROG> refreshLock <TOPOSPOOL> 
54ba595e-35cf-11e2-bc82-52540082c309 300]
DEBUG: In refreshLock got status: 0 and result (if any) []
DEBUG: refreshLock will now sleep for: 151
DEBUG: Ending with slave MSG for token: 22867867 containing refineEntryNrg 2lqo 
9 http://<NMR>/NRG-CING/data i@<NMR>:<D>/NMR_REDO . . BY_CH23_BY_ENTRY CING 0 
auto 0 0 1 with message: command_finished
###
### The token is deleted
###
DEBUG: In deleteToken doing [<TOPOSPROG> deleteToken <TOPOSPOOL> 22867867]
DEBUG: cmd: ssh i@<NMR> mkdir -p <D><SDIR><TOPOSPOOL>/log2
DEBUG: cmd: rsync  -ave ssh 22867867_refineEntryNrg_2lqo_NJZQd4.log 
i@<NMR>:<D><SDIR><TOPOSPOOL>/log2/
DEBUG: Time is 2012-11-24 08:35:11.572131, finished token
###
### Again a token is found and background process 3 is created to keep the lock 
fresh
###
DEBUG: In nextTokenWithLock doing [<TOPOSPROG> nextTokenWithLock <TOPOSPOOL> 
300]
DEBUG: Created a background process [13820] keeping the lock
DEBUG: Time is 2012-11-24 08:35:15.720131
DEBUG: refreshLock will now sleep for: 151
DEBUG: In getToken doing [<TOPOSPROG> getToken <TOPOSPOOL> 22867910]
DEBUG: Got token content on try: 1: [refineEntryNrg 2rrn 9 
http://<NMR>/NRG-CING/data i@<NMR>:<D>/NMR_REDO . . BY_CH23_BY_ENTRY CING 0 
auto 0 0 1]
In 2/3/3 (finished/total/iterations) got token 22867910 with lock: 
7bf36b0c-3609-11e2-bc82-52540082c309
Found tokenContent: refineEntryNrg 2rrn 9 http://<NMR>/NRG-CING/data 
i@<NMR>:<D>/NMR_REDO . . BY_CH23_BY_ENTRY CING 0 auto 0 0 1
DEBUG: Time is 2012-11-24 08:35:17.768133
Running payload cd /home/i/tmp/cingTmp; <SCRIPTS>refineEntry.py 2rrn 9 
http://<NMR>/NRG-CING/data i@<NMR>:<D>/NMR_REDO . . BY_CH23_BY_ENTRY CING 0 
auto 0 0 1 > 22867910_refineEntryNrg_2rrn_cjH8ZQ.log 2>&1
DEBUG: Time is 2012-11-24 08:35:17.768133
Doing 1 new process(es)
Doing command: cd /home/i/tmp/cingTmp; <SCRIPTS>refineEntry.py 2rrn 9 
http://<NMR>/NRG-CING/data i@<NMR>:<D>/NMR_REDO . . BY_CH23_BY_ENTRY CING 0 
auto 0 0 1 > 22867910_refineEntryNrg_2rrn_cjH8ZQ.log 2>&1
DEBUG: keepLockFresh doing a refreshLock
###
### After 7 hours, background process 2 still runs even though the token has 
been deleted.
###
DEBUG: Time is 2012-11-24 08:37:24.982822
DEBUG: In refreshLock doing [<TOPOSPROG> refreshLock <TOPOSPOOL> 
54ba595e-35cf-11e2-bc82-52540082c309 300]
DEBUG: In refreshLock got status: 768 and result (if any) []
In keepLockFresh got status: 768. This indicates that the token process 
finished.
###
### Process 3 also does refreshLock
###
DEBUG: keepLockFresh doing a refreshLock
DEBUG: Time is 2012-11-24 08:37:46.824134
DEBUG: In refreshLock doing [<TOPOSPROG> refreshLock <TOPOSPOOL> 
7bf36b0c-3609-11e2-bc82-52540082c309 300]
DEBUG: In refreshLock got status: 0 and result (if any) []
DEBUG: refreshLock will now sleep for: 151
DEBUG: keepLockFresh doing a refreshLock
... 
### doing keepLockFresh doing refreshLocks
...

Original issue reported on code.google.com by WGTouw on 24 Nov 2012 at 5:01

GoogleCodeExporter commented 8 years ago
Looks indeed like this is a topos related issue.
The lines you show at:

DEBUG: In refreshLock doing [<TOPOSPROG> refreshLock <TOPOSPOOL> 
5ae3c102-3576-11e2-bc82-52540082c309 300]
DEBUG: In refreshLock got status: 768 and result (if any) []
###

are really at one of the lowest levels and do don not depend on the other 
timings set in VC. At this point of debugging I suggest you verify with topos 
guys about this limit of 5 hours. I have done longer jobs than this before 
using Topos. Let me cc Victor on this as well as he might be using it currently 
too.

Victor, 
The question is if topos nowadays has an upper time limit itself for not 
allowing tokens to be locked beyond the 5 hours we see now.

Original comment by jurge...@gmail.com on 24 Nov 2012 at 7:51

GoogleCodeExporter commented 8 years ago
-- In vCing.py
        cmdTopos = ' '.join([self.toposProg, 'refreshLock', self.toposPool, lockname, repr(lockTimeOut)])
        nTdebug("In refreshLock doing [%s]" % cmdTopos)
        status, result = commands.getstatusoutput(cmdTopos)
        nTdebug("In refreshLock got status: %s and result (if any) [%s]" % (status, result))
-- In $C/scripts/vcing/topos/topos
EXIT_NOCOMMAND=-1
EXIT_MISSINGPARAM=1
EXIT_FILENOTFOUND=2
EXIT_CURLERROR=3

Most likely the value you (Wouter) sees is 768 / 256 (from 
commands.getstatusoutput translation) -> 3
A curl error so likely in the topos script:
  ${CURL} --request HEAD "${TOPOS_URL}pools/${poolName}/locks/${lockName}?timeout=${timeout}"
  if [ "$?" != "0" ]; then
    exit $EXIT_CURLERROR

Now a curl error could still be anything like bad network connection. 
Temporarily unavailable topos resource....

Original comment by jurge...@gmail.com on 24 Nov 2012 at 8:11

GoogleCodeExporter commented 8 years ago
Wouter, ask around at Sara if they are having more people with topos issues. 
Good luck.

Original comment by jurge...@gmail.com on 24 Nov 2012 at 8:15

GoogleCodeExporter commented 8 years ago
I'll fire up this discussion on tuesday when I have a meeting with people
from SARA.

Victor

Original comment by victor.d...@gmail.com on 24 Nov 2012 at 8:53

GoogleCodeExporter commented 8 years ago
Meanwhile Wouter might be able to check that the overcommitting that I put in 
place just before handing this over to Wouter doesn't cause it. Note that on a 
VM with 8 cores we issue 8 threads that EACH think they have 8 cores. The 
overcommit clearly stresses resources such as the network connections.

Original comment by jurge...@gmail.com on 25 Nov 2012 at 3:51

GoogleCodeExporter commented 8 years ago
Before I documented the current issue I had already disabled the overcommitting 
as this caused the system to start killing processes at random as it was out of 
memory. Thus, the current problem has a different cause.

Original comment by WGTouw on 25 Nov 2012 at 4:02

GoogleCodeExporter commented 8 years ago
The thread shows 8 cores. You only start one thread?

>linux2.6.38.8-tweak/32bit/8cores/py2.7.1+) 

Original comment by jurge...@gmail.com on 25 Nov 2012 at 4:20

GoogleCodeExporter commented 8 years ago
No, one thread per core. I.e. no overcommitting nor undercommitting.

Original comment by WGTouw on 25 Nov 2012 at 4:22

GoogleCodeExporter commented 8 years ago
The problem seems to have been fixed. SARA turned off a daily back-up. Initial 
test runs don't show any premature keepLockFresh stops.

Original comment by WGTouw on 28 Nov 2012 at 9:39

GoogleCodeExporter commented 8 years ago

Original comment by WGTouw on 28 Nov 2012 at 9:41

GoogleCodeExporter commented 8 years ago
Nice.
I hope they keep them off or find a way to do it no-blocking.

Original comment by jurge...@gmail.com on 28 Nov 2012 at 1:23