vision-dbms / vision

The master repository for the Vision database system.
https://vision-dbms.com
BSD 3-Clause "New" or "Revised" License
27 stars 12 forks source link

Multiple hard restart messages to pool can leave it unintentionally suspended #25

Closed MichaelJCaruso closed 7 years ago

MichaelJCaruso commented 7 years ago

Leslie asked me to take a look at a problem with hard restarts of release 8.1 pools. Specifically, if a succession of hard restart messages are sent to a running pool and those messages arrive faster than the pool can actually be restarted, the pool is inadvertently left in a suspended state instead of being 'restarted' as expected. I have confirmed this (mis)behavior and have tracked it to a fairly straightforward logic bug in a change to the implementation of hard restart in release 8.1.

In release 8.1, hard restart was changed to wrap its restart processing in a suspend/resume pair. While suspending a running pool is atomic and instantaneous, restarting one is not. To deal with that, a callback is set _(Vsa_VEvaluatorPool.cpp#736)_ that performs the needed restart wrap-up and resumes the pool after the restart processing has finished. _There is exactly one slot for remembering that callback (Vsa_VEvaluatorPool.cpp#794)._ Other than the routine embedded in the callback, there is no record of the fact that a resume must be run after the restart has completed. Hard restart messages received before the currently running hard restart has finished destroy that information, resulting in the observed bug.

Hard restart messages received by a pool that is already processing a hard restart follow a different path. Noting that the pool is already suspended, they do not attempt to wrap their restart processing in a suspend/resume pair. Instead, after initiating a restart, they register a callback _(Vsa_VEvaluatorPool.cpp#733)_ that performs the needed restart wrap-up and leaves the pool in the suspended state in which it was found. Note that there is only one slot for remembering that callback and that the callback that would have done the required resume has just been obliterated.

It seems to me that the crux of the problem is an attempt to encode state information in the opaque machinery of which callback must be run instead of adding a new explicit state that unambiguously identifies the internal suspension being performed. Shouldn't be hard - glad to talk about it further.

VCommitter commented 7 years ago

I don't think you mean adding a status to Vsa_VEvaluator.h#L97 and cover all of it's usage in the class hierarchy. m_iStatus lead me down this false path.

I propose a flag such as explicitlySuspended in VEvaluatorPool. The callback acts appropriately and we extend suspend() and resume() to set and unset explictlySuspended (respectively).

The only known unknown is how should the VEvaluatorPool behave if it gets a resume() while suspended due to a hardRestart()? Should it perform the BaseClass::resume() or just unset explicitlySuspended and wait for the callback to resume()?

Right now there is another related bug: if the VEvaluatorPool is suspended due to stacked hardRestart() it won't actually resume() until you send it an explicit suspend(). resume() fails with Error: Operation Failed until the explicit suspend() is sent. I haven't dug into this yet because I'm hoping fixing the first bug obviates the second.

@c-kuhlman how is poolTester looking?

MichaelJCaruso commented 7 years ago

Absent creating a more complex state management system (which doesn't seem like it's needed here), I agree that Vsa::VEvaluator cannot be the keeper of its subclass' states.

In terms of distinguishing internal and external suspensions, what if VEvaluator implements protected subclassSuspend and subclassResume methods for use by its subclasses and private internalSuspend and internalResume methods for its own use in addition to its current notions of external suspend and resume. It can then take responsibility for tracking suspend/resume by source (I'd prefer suspend counts but I think booleans are the best we can do for now). Unless I'm missing something, it may then be as simple of having each suspend/resume variant manage its own flag before calling the internal variant to enforce overall behavior.

VCommitter commented 7 years ago

I think if we override suspend() and resume() at VEvaluatorPool with a variant that takes a parameter (i.e. bool explict) we don't have to touch VEvaluator.

Unless you think explicit suspends are something worth pushing down to VEvaluator I'd prefer to keep them encapsulated within VEvaluatorPool.

MichaelJCaruso commented 7 years ago

As long as it handles the state interactions consistently with a minimum of added spaghetti. For example, a resume during hard restart should not actually resume the pool until the hard restart has finished (your earlier question).

MichaelJCaruso commented 7 years ago

Following up on my previous comment, two bits of information are needed to encode the 4 possible internal/external suspend/resume states possible for any evaluator. A good case can be made for either approach. Either way, VEvaluator's API still needs to provide basic support for maintaining those two bits of information.

VCommitter commented 7 years ago

I took a stab at implementing a fix in branch pool-unintentional-suspension-25 (f25814adb17a3e9ee3672a7bd1e2a076b90133e0). I didn't understand why you thought VEvaluator needed to be involved but rather than try to figure it out in comments I decided to take a stab at my solution and show instead of tell.

The code took about an hour. It took me the rest of the day reproduce this issue outside of our environment. I'm not quite sure why it was so difficult to reproduce but it finally started manifesting itself once I added in the csh wrapper for the batchvision which made timing easier to figure out (and might be the key to not having the worker start up instantaneously from vpool's perspective). Anyway, here is the reproducer and it's output after my fix:

src/master/src> cat reproducer.csh
\rm -rf DELETEME
mkdir DELETEME
setenv LogSwitchOn
echo "sleep 2" > DELETEME/workerSource.csh
echo "backend/batchvision" >> DELETEME/workerSource.csh
setenv WorkerSource "csh DELETEME/workerSource.csh"
setenv WorkerPIDQueryStr "Utility processId"
setenv WorkerStartupQuery 2+2
vcaservicemanager/vcaservicemanager -serverFile=DELETEME/serverFile.txt ./DELETEME/ epool -logFilePath=./DELETEME/poolLog.txt &
sleep 4
echo "THREE HARD RESTARTS (two is insufficient)"
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
echo -n "Status will be 'Suspended': " && vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -stat | grep Status
vsaprompt/vsaprompt -serverFile=DELETEME/serverFile.txt <<EOF
2+2
?g
EOF

echo -n "If broken we won't get this far': " && vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -stat | grep Status

vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -stop
cat DELETEME/poolLog.txt

src/master/src> csh reproducer.csh
[1] 16084
THREE HARD RESTARTS (two is insufficient)
Operation Succeeded
Operation Succeeded
Operation Succeeded
Status will be 'Suspended': Current Status                 : Suspended
V> V>      4.00
V> If broken we won't get this far': Current Status                 : Running
Message sent
08/22/2017:19:58:25.673 New offers (connections)...
08/22/2017:19:58:25.673 POOL STARTED
08/22/2017:19:58:25.673 Hiring worker(s) as necessary.
08/22/2017:19:58:25.673 Hiring new worker.
08/22/2017:19:58:27.740 Creating Worker 16109 [1] (Generation 1)
08/22/2017:19:58:27.740 Worker 1 created successfully
08/22/2017:19:58:27.740 Worker 16109[1] (Generation 1) Online: Employed
08/22/2017:19:58:27.740 Hiring worker(s) as necessary.
08/22/2017:19:58:29.676 HARD RESTARTING POOL
08/22/2017:19:58:29.677 SUSPENDING POOL
08/22/2017:19:58:29.677 Flushing Pool Workers
08/22/2017:19:58:29.677 Retiring Worker(1) (Generation 1)
08/22/2017:19:58:29.677 RESTARTING POOL
08/22/2017:19:58:29.677 Hiring worker(s) as necessary.
08/22/2017:19:58:29.677 RESUMING POOL
08/22/2017:19:58:29.677 Hiring worker(s) as necessary.
08/22/2017:19:58:29.677 Hiring new worker.
08/22/2017:19:58:29.679 Zero offers (connections)...
08/22/2017:19:58:29.720 HARD RESTARTING POOL
08/22/2017:19:58:29.720 SUSPENDING POOL
08/22/2017:19:58:29.720 Flushing Pool Workers
08/22/2017:19:58:29.721 Zero offers (connections)...
08/22/2017:19:58:29.760 HARD RESTARTING POOL
08/22/2017:19:58:29.760 Flushing Pool Workers
08/22/2017:19:58:29.761 Zero offers (connections)...
08/22/2017:19:58:29.801 Zero offers (connections)...
08/22/2017:19:58:29.841 Request(0) (With valid generation 2) arrived: 2+2

08/22/2017:19:58:31.742 Creating Worker 16283 [2] (Generation 2)
08/22/2017:19:58:31.742 Worker 2 created successfully
08/22/2017:19:58:31.742 RESTARTING POOL
08/22/2017:19:58:31.742 Hiring worker(s) as necessary.
08/22/2017:19:58:31.742 RESUMING POOL
08/22/2017:19:58:31.742 Hiring worker(s) as necessary.
08/22/2017:19:58:31.742 Hiring new worker.
08/22/2017:19:58:31.742 Retiring Worker(2) (Generation 2)
08/22/2017:19:58:31.743 Worker 16283[2] (Generation 2) Online: Retired
08/22/2017:19:58:31.743 Hiring worker(s) as necessary.
08/22/2017:19:58:33.805 Creating Worker 16324 [3] (Generation 3)
08/22/2017:19:58:33.805 Worker 3 created successfully
08/22/2017:19:58:33.805 Worker 16324[3] (Generation 3) Online: Employed
08/22/2017:19:58:33.805 Hiring worker(s) as necessary.
08/22/2017:19:58:33.805 Worker(3) Generation (3) processing query (0).
08/22/2017:19:58:33.805 Hiring worker(s) as necessary.
08/22/2017:19:58:33.805 Worker(3) (Generation 3) returned:      4.00
08/22/2017:19:58:33.805 Worker 16324[3] (Generation 3) Online: Employed
08/22/2017:19:58:33.806 Hiring worker(s) as necessary.
08/22/2017:19:58:33.806 Zero offers (connections)...
08/22/2017:19:58:33.847 Zero offers (connections)...
08/22/2017:19:58:33.886 Gracefully stopping server.....
08/22/2017:19:58:33.887 Zero offers (connections)...
08/22/2017:19:58:33.887 HARD STOPPING POOL
08/22/2017:19:58:33.887 Retiring Worker(3) (Generation 3)
08/22/2017:19:58:33.889 Pool Application STOP - Completed.

And the test I came up with to verify that suspending explicitly would keep the hardRestart from resuming the pool:

src/master/src> cat explicitly.csh
\rm -rf DELETEME
mkdir DELETEME
setenv LogSwitchOn
echo "sleep 2" > DELETEME/workerSource.csh
echo "backend/batchvision" >> DELETEME/workerSource.csh
setenv WorkerSource "csh DELETEME/workerSource.csh"
setenv WorkerPIDQueryStr "Utility processId"
setenv WorkerStartupQuery 2+2
vcaservicemanager/vcaservicemanager -serverFile=DELETEME/serverFile.txt ./DELETEME/ epool -logFilePath=./DELETEME/poolLog.txt &
sleep 3
echo "HARD RESTARTS AND a suspend"
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -suspend
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
echo -n "Status should be 'Suspended': " && vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -stat | grep Status

sleep 5
echo -n "Status should be 'Suspended': " && vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -stat | grep Status

vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -resume
vsaprompt/vsaprompt -serverFile=DELETEME/serverFile.txt <<EOF
2+2
?g
EOF
echo -n "Status should be 'Running': " && vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -stat | grep Status

echo "HARD RESTARTS and a suspend followed by a resume"
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -suspend
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -resume
vsaprompt/vsaprompt -serverFile=DELETEME/serverFile.txt <<EOF
2+2
?g
EOF
echo -n "Status should be 'Running': " && vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -stat | grep Status

echo "HARD RESTARTS and suspends followed by resumes"
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -suspend
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -suspend
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -resume
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -suspend
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -resume
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -resume
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vsaprompt/vsaprompt -serverFile=DELETEME/serverFile.txt <<EOF
2+2
?g
EOF
echo -n "Status should be 'Running': " && vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -stat | grep Status

vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -stop
cat DELETEME/poolLog.txt

src/master/src> csh explicitly.csh
[1] 25479
HARD RESTARTS AND a suspend
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Status should be 'Suspended': Current Status                 : Suspended
Status should be 'Suspended': Current Status                 : Suspended
Operation Succeeded
V> V>      4.00
V> Status should be 'Running': Current Status                 : Running
HARD RESTARTS and a suspend followed by a resume
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
V> V>      4.00
V> Status should be 'Running': Current Status                 : Running
HARD RESTARTS and suspends followed by resumes
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Error: Operation Failed
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Error: Operation Failed
Operation Succeeded
Operation Succeeded
Operation Succeeded
V> V>      4.00
V> Status should be 'Running': Current Status                 : Running
Message sent

Finally, this script reproduces the original problem and confirms the workaround is working.

src/master/src> cat workaround.csh
\rm -rf DELETEME
mkdir DELETEME
setenv LogSwitchOn
echo "sleep 2" > DELETEME/workerSource.csh
echo "backend/batchvision" >> DELETEME/workerSource.csh
setenv WorkerSource "csh DELETEME/workerSource.csh"
setenv WorkerPIDQueryStr "Utility processId"
setenv WorkerStartupQuery 2+2
vcaservicemanager/vcaservicemanager -serverFile=DELETEME/serverFile.txt ./DELETEME/ epool -logFilePath=./DELETEME/poolLog.txt &
sleep 4
echo "THREE HARD RESTARTS (two is insufficient)"
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -hardRestart
echo -n "Status will be 'Suspended': " && vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -stat | grep Status

sleep 10
echo -n "If broken Status will be 'Suspended': " && vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -stat | grep Status

echo "RUNNING WORKAROUND"
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -suspend
vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -resume
echo -n "Should be 'Running' now: " && vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -stat | grep Status
vsaprompt/vsaprompt -serverFile=DELETEME/serverFile.txt <<EOF
2+2
?g
EOF

vpooladmin/vpooladmin -serverFile=DELETEME/serverFile.txt -stop
cat DELETEME/poolLog.txt
src/master/src> csh workaround.csh
[1] 35147
THREE HARD RESTARTS (two is insufficient)
Operation Succeeded
Operation Succeeded
Operation Succeeded
Status will be 'Suspended': Current Status                 : Suspended
If broken Status will be 'Suspended': Current Status                 : Running
RUNNING WORKAROUND
Operation Succeeded
Operation Succeeded
Should be 'Running' now: Current Status                 : Running
V> V>      4.00
V> Message sent

@c-kuhlman I'm pretty sure this is the kind of test you can put in poolTester but I didn't want to step on your toes while I fumbled around there and tried to figure it out so I just used these simple csh scripts.

@MichaelJCaruso let me know what you think about this attempt at a fix. It's my first Vsa functionality change.

MichaelJCaruso commented 7 years ago

I didn't understand why you thought VEvaluator needed to be involved ...

I just figured out why I didn't understand why you thought it didn't :-).

Digging around a bit, I just noticed that VEvaluatorPool re-implements the entire IEvaluatorControl interface, an interface already implemented by VEvaluator. It should have been written to rely on virtual overrides in some form to guarantee consistent behavior between the methods of VEvaluator and its VEvaluatorPool subclass. That's a significant logic bug - possibly of my own doing in the dim and distant past. Of course, because the only current source of suspend and resume calls are via IEvaluatorControl operations as separately re-implemented by VEvaluatorPool, what you're doing works; however, it doesn't cover all the ways suspend and resume can be called on the classes involved. Covering those bases requires involvement - in some form - of both classes.

VCommitter commented 7 years ago

it doesn't cover all the ways suspend and resume can be called on the classes involved.

That isn't necessarily a problem so long as the other paths through suspend and resume can be considered explicit. Which I'm OK with for now if this fix corrects the bug without introducing incorrect behavior.

It should have been written to rely on virtual overrides in some form to guarantee consistent behavior between the methods of VEvaluator and its VEvaluatorPool subclass.

I am curious though: are you saying VEvaluator::suspend() and VEvaluator::resume() (and the other control methods) should be virtual?

Or are we talking about Vca interface overrides - I've forgotten more than I remember about how those are supposed to work...

MichaelJCaruso commented 7 years ago

We can certainly cover this bug fix with what you've done, but in that case, I suggest we leave a prominent note in the header files (most importantly VEvaluator.h) calling attention to the possibility of inconsistencies when calling the public control methods directly. It might even be a good idea to make those routines protected to prevent their explicit misuse. That might be enough of a note to tie to the can we now know about and kick down the road.

The rest of my concerns deal with interface consistency and the latent bugs it can cause. Leaving the Vca interfaces out of it for a moment, the public methods of VEvaluator and VEvaluatorPool present two interfaces to the same object. Given a pointer pEvaluator to a VEvaluatorPool, pEvaluator->suspend () and static_cast<VEvaluator*>(pEvaluator)->suspend() should produce the same result, ideally in an obviously auditable way (even though the cast is explicit here, it's implicit most of the time). Same goes for the operations exposed via IEvaluatorControl (which we have don't have as much of a problem with given that Vca ensures that VEvaluatorPool's implementation dominates in all cases, at least as seen by remote callers).

Although I wasn't trying to pick an implementation approach, I think it's almost certainly a good idea to make the control methods virtual given that they are implemented differently in VEvaluatorPool. When making suspend and resume virtual, note that they should not be given an optional boolean to indicate their internal/external origin. Doing so changes their meaning in a way that gives public callers the opportunity to masquerade as insiders.

To dot the i's and cross the t's, it probably also makes sense to delete implementations of IEvaluatorControl methods at VEvaluatorPool that are identical to what can be already be inherited from VEvaluator.

I don't know if this fixes all of the latent issues, but I think it covers some useful ground.

VCommitter commented 7 years ago

That sounds like an extensive refactor of some heavily used functionality. I'm certainly not going to advocate for changes at this scale to fix this small but troublesome bug; especially before the poolTester suite is back up and running.

Perhaps the VEvaluator and VEvaluatorPool refactor/cleanup should be spun off into it's own issue?

MichaelJCaruso commented 7 years ago

Sounds like a reasonable plan. If nothing else, we've got a better handle on how this code works and where it's fragile.

c-kuhlman commented 7 years ago

I agree. This has been this way FOREVER. I don't think I was aware of the VEvaluator level implementations until vcatool came out, and I saw that some of the "pool" operations were available (like suspend and resume). I thought it was a bit odd, but it had been that way FOREVER even back then -- so I ignored it.

I am about to issue a couple of pull requests -- one for 8.0 and one for 8.1. They entail a minor bug fix in batchvision. The bug showed up when running the testdeck. I fixed it in 8.0 and then merged it into my 8.1 branch. In 8.1, I also added the -xmemalign compiler option which I mentioned before. It is needed in 8.1 on sparc, but not in 8.0 -- haven't delved into why.

Anyway, the poolTester updates will follow shortly. I have been playing with poolTester today, and will attempt to get a reproducer for the hardRestart issue. There is a test already called fastHardRestart, but it is testing a different issue -- I don't recall what. I will add another aptly named test if I am successful in reproducing it.

MichaelJCaruso commented 7 years ago

Sad but true. The legacy nature of the bug makes it harder to address and almost more of a feature than a bug. Still, it also makes the code more fragile under modification (and use). On that front, I don't mind advocating for theory of operation and clean, consistent interfaces.

In terms of our proposed fix, it looks like there's a problem of a different sort. There's no logic in resume (bool explicit) that detects and respects any internal suspension that might currently be in effect. Assuming there's a good reason for the internal suspension initiated by hard restart, it should be honored until explicitly cleared by its internal originator and not subject to external cancellation. That does not appear to be the case. Am I missing something?

Note that a similar situation exists in suspend (bool explicit); however, in that case, the failure to note that an internal suspension is already in effect results in the issuance of multiple base class suspend calls - a presumably less significant issue.

VCommitter commented 7 years ago

I like that implementation idea; I think it will wind up cleaner. I'll try it out.

VCommitter commented 7 years ago

7ca7cf4ca749883c5687a79f459a60ef3d6b36a4 encapsulates the explicit vs internal logic in the suspend and resume methods.

MichaelJCaruso commented 7 years ago

Not quite there yet. Suppose the following sequence of events occur in a running pool:

Hard Restart (implicit suspend)
Suspend (explicit suspend)
Resume (explicit resume)

Is the pool running or suspended? My read says running:

Hard Restart (implicit suspend) ... suspend pool, leave explicitly == false
Suspend (explicit suspend)      ... set explicitly = true, note pool is already suspended
Resume (explicit resume)        ... set explicitly = false, note explicitly == false, resume pool

It should be suspended.1

I don't think you can do this without two bits of information. The bit recording the presence of an explicit suspend you already have. You need a second bit to record that an internal (implicit) suspend has been done. You can't actually do the resume until both bits are clear (the gist of this previous comment)

1 The order of the hard restart and explicit suspend can be reversed. The outcome will be the same.

VCommitter commented 7 years ago

I see. You're still proposing to change more behavior than I intend to. There is only one change I intend to make in the pool behavior:

Behaviors that I intend to keep are:

I want to keep the explicit as an override of internal suspend because that is how it works up until now AND this behavior was how we were able to get stuck 8.1 pools running after they got stuck. If an explicit suspend()/resume() didn't override an implicit resume those pools could not be un-stuck.

I think this is good to go unless there is another bug in the way Suspended is being used here that I don't have the context to be aware of.

8.0 Reference:

% csh behavior.csh
VISION VERSION: AUDITINFO:Source:git:work-tree=/home/user/tkowalczyk/tmp/vision-open-source,branch=release-8.0,commit=5b9770f154381eb85c544ddaf27adb3a9ee04157,differences=../DELETEME/batchvisionVca.log--../DELETEME/poolLog.txt--../DELETEME/serverFile.txt--../DELETEME/...
[1] 27021

simple hardRestart suspend resume
Operation Succeeded
Operation Succeeded
Operation Succeeded
Status will be 'Running': Current Status                 : Running
V> V>      4.00
V> Status will be 'Running': Current Status                 : Running

simple suspend resume hardRestart hardRestart
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Status will be 'Suspended': Current Status                 : Suspended
V> V>      4.00
V> Status will be 'Running': Current Status                 : Running

suspend hardRestart - should stay suspended
Operation Succeeded
Operation Succeeded
Status SHOULD be 'Suspended': Current Status                 : Running
sleep 5
Status SHOULD be 'Suspended': Current Status                 : Running

resume - no more suspended
Operation Succeeded
Status SHOULD be 'Running': Current Status                 : Running

3 x hardRestart - don't get stuck suspended...
Operation Succeeded
Operation Succeeded
Operation Succeeded
Status SHOULD be 'Suspended': Current Status                 : Suspended
sleep 5
Status SHOULD be 'Running': Current Status                 : Running

3 x hardRestart - and a suspend resume workaround to unstuck the pool
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Status SHOULD be 'Running': Current Status                 : Running
V> V>      4.00
V> Status will be 'Running': Current Status                 : Running

hardRestart, hardRestart and resume - resume should WIN but never does at first?
Operation Succeeded
Operation Succeeded
Error: Operation Failed
Status SHOULD be 'Running': Current Status                 : Suspended
V> V>      4.00
V> Status will be 'Running': Current Status                 : Running

Message sent

pool-unintentional-suspension-25 behavior:

% csh behavior.csh
VISION VERSION: AUDITINFO:Source:git:work-tree=/home/user/tkowalczyk/tmp/vision-open-source,branch=pool-unintentional-suspension-25,commit=7ca7cf4ca749883c5687a79f459a60ef3d6b36a4
[1] 63235

simple hardRestart suspend resume
Operation Succeeded
Operation Succeeded
Operation Succeeded
Status will be 'Running': Current Status                 : Running
V> V>      4.00
V> Status will be 'Running': Current Status                 : Running

simple suspend resume hardRestart hardRestart
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Status will be 'Suspended': Current Status                 : Suspended
V> V>      4.00
V> Status will be 'Running': Current Status                 : Running

suspend hardRestart - should stay suspended
Operation Succeeded
Operation Succeeded
Status SHOULD be 'Suspended': Current Status                 : Suspended
sleep 5
Status SHOULD be 'Suspended': Current Status                 : Suspended

resume - no more suspended
Operation Succeeded
Status SHOULD be 'Running': Current Status                 : Running

3 x hardRestart - don't get stuck suspended...
Operation Succeeded
Operation Succeeded
Operation Succeeded
Status SHOULD be 'Suspended': Current Status                 : Suspended
sleep 5
Status SHOULD be 'Running': Current Status                 : Running

3 x hardRestart - and a suspend resume workaround to unstuck the pool
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Operation Succeeded
Status SHOULD be 'Running': Current Status                 : Running
V> V>      4.00
V> Status will be 'Running': Current Status                 : Running

hardRestart, hardRestart and resume - resume should WIN but never does at first?
Operation Succeeded
Operation Succeeded
Error: Operation Failed
Status SHOULD be 'Running': Current Status                 : Suspended
V> V>      4.00
V> Status will be 'Running': Current Status                 : Running

Message sent
MichaelJCaruso commented 7 years ago

You guys have to keep this thing running, but, by way of last word...

ARRRRRGH !!!!

VCommitter commented 7 years ago

Fixed by pull #26