Closed GoogleCodeExporter closed 9 years ago
Looks indeed like this is a topos related issue.
The lines you show at:
DEBUG: In refreshLock doing [<TOPOSPROG> refreshLock <TOPOSPOOL>
5ae3c102-3576-11e2-bc82-52540082c309 300]
DEBUG: In refreshLock got status: 768 and result (if any) []
###
are really at one of the lowest levels and do don not depend on the other
timings set in VC. At this point of debugging I suggest you verify with topos
guys about this limit of 5 hours. I have done longer jobs than this before
using Topos. Let me cc Victor on this as well as he might be using it currently
too.
Victor,
The question is if topos nowadays has an upper time limit itself for not
allowing tokens to be locked beyond the 5 hours we see now.
Original comment by jurge...@gmail.com
on 24 Nov 2012 at 7:51
-- In vCing.py
cmdTopos = ' '.join([self.toposProg, 'refreshLock', self.toposPool, lockname, repr(lockTimeOut)])
nTdebug("In refreshLock doing [%s]" % cmdTopos)
status, result = commands.getstatusoutput(cmdTopos)
nTdebug("In refreshLock got status: %s and result (if any) [%s]" % (status, result))
-- In $C/scripts/vcing/topos/topos
EXIT_NOCOMMAND=-1
EXIT_MISSINGPARAM=1
EXIT_FILENOTFOUND=2
EXIT_CURLERROR=3
Most likely the value you (Wouter) sees is 768 / 256 (from
commands.getstatusoutput translation) -> 3
A curl error so likely in the topos script:
${CURL} --request HEAD "${TOPOS_URL}pools/${poolName}/locks/${lockName}?timeout=${timeout}"
if [ "$?" != "0" ]; then
exit $EXIT_CURLERROR
Now a curl error could still be anything like bad network connection.
Temporarily unavailable topos resource....
Original comment by jurge...@gmail.com
on 24 Nov 2012 at 8:11
Wouter, ask around at Sara if they are having more people with topos issues.
Good luck.
Original comment by jurge...@gmail.com
on 24 Nov 2012 at 8:15
I'll fire up this discussion on tuesday when I have a meeting with people
from SARA.
Victor
Original comment by victor.d...@gmail.com
on 24 Nov 2012 at 8:53
Meanwhile Wouter might be able to check that the overcommitting that I put in
place just before handing this over to Wouter doesn't cause it. Note that on a
VM with 8 cores we issue 8 threads that EACH think they have 8 cores. The
overcommit clearly stresses resources such as the network connections.
Original comment by jurge...@gmail.com
on 25 Nov 2012 at 3:51
Before I documented the current issue I had already disabled the overcommitting
as this caused the system to start killing processes at random as it was out of
memory. Thus, the current problem has a different cause.
Original comment by WGTouw
on 25 Nov 2012 at 4:02
The thread shows 8 cores. You only start one thread?
>linux2.6.38.8-tweak/32bit/8cores/py2.7.1+)
Original comment by jurge...@gmail.com
on 25 Nov 2012 at 4:20
No, one thread per core. I.e. no overcommitting nor undercommitting.
Original comment by WGTouw
on 25 Nov 2012 at 4:22
The problem seems to have been fixed. SARA turned off a daily back-up. Initial
test runs don't show any premature keepLockFresh stops.
Original comment by WGTouw
on 28 Nov 2012 at 9:39
Original comment by WGTouw
on 28 Nov 2012 at 9:41
Nice.
I hope they keep them off or find a way to do it no-blocking.
Original comment by jurge...@gmail.com
on 28 Nov 2012 at 1:23
Original issue reported on code.google.com by
WGTouw
on 24 Nov 2012 at 5:01