Closed GoogleCodeExporter closed 8 years ago
------- Comment #16 From Bob Marinier 2007-08-21 18:32:09 (-) [reply] -----
--
The problem is not tied to rete-net; an independent init-soar will cause
the
problem. (Also, in case there is still some doubt, we removed all
callbacks
and the problem still persisted).
I found that JavaTOH appears to work fine, but it will also fail if you do
an
init-soar before running (e.g., by clicking the reset button or by
connecting a
debugger, doing an init-soar, and then disconnecting).
Here are some cases (these are much easier to do with a debug build; the
release build finishes too fast):
1) Start JavaTOH, run. Connect debugger. (succeeds)
2) Start JavaTOH, reset, run. Connect debugger. (fails)
3) Start JavaTOH, connect debugger, init-soar from debugger any number of
times. Disconnect debugger, run. Connect debugger. (fails)
For the last case, it doesn't actually matter if you disconnect the
debugger.
Leaving it connected and trying to connect a second debugger also fails.
------- Comment #17 From Douglas Pearson 2007-08-21 18:58:21 (-) [reply] --
-----
Wow, that's a pretty nasty bug and could have been there for a while I
guess as
the normal test is to launch the environment, run the agent and connect the
debugger and it sounds like that still works.
Putting together all of these bits of evidence it sounds to me like there's
a
problem in one of the internal event handlers (i.e. inside the C++) and
after
an init-soar it's causing that event handler to hang during the next run.
I can't think of any easy way to figure out where it's stuck. There's not
a
ton of internal callbacks -- off the top of my head I can think of print
and
XML handlers, run scheduler, input/output callbacks and init-soar
listeners.
Removing all of the callbacks from the environment was a nice step as that
rules out a lot more possible cases.
A couple more questions:
-- Just to be clear, after those init-soar trials, there's no problem
connecting the debugger if the agent's not actively running, right?
-- Can you see if the receiver thread is already locked up after the init-
soar
but before the next run? Also, what if you just do a "run 10" and then
connect
the debugger after that. Is it OK in those cases? I'm assuming the answer
will be yes, which unfortunately makes this much harder to solve.
------- Comment #18 From Bob Marinier 2007-08-21 19:05:33 (-) [reply] -----
--
> A couple more questions:
> -- Just to be clear, after those init-soar trials, there's no problem
> connecting the debugger if the agent's not actively running, right?
Right. In fact, as soon as the agent stops, a "hung" debugger will
successfully connect (even if Windows says it's not responding).
> -- Can you see if the receiver thread is already locked up after the
init-soar
> but before the next run? Also, what if you just do a "run 10" and then
connect
> the debugger after that. Is it OK in those cases? I'm assuming the
answer
> will be yes, which unfortunately makes this much harder to solve.
I'm not sure how to check if the receiver thread is already locked up.
Unfortunately, I'm leaving for the rest of the evening right now, but I'll
check these things either later tonight or first thing in the morning.
------- Comment #19 From Douglas Pearson 2007-08-21 19:15:17 (-) [reply] --
-----
I think if you attach the debugger (Visual C++) and set a break point in
the
ReceiverThread::run() method you should be able to see if it's still
looping or
if that thread is stuck in a callback.
If it's stuck in a callback before the run starts that would be great as we
could figure out where it's stuck. Much harder to do once the run begins
and
there's lots of traffic.
Another test we could try is connecting something simpler through the
remote
path--e.g. a simple C++ test program that just called
CreateRemoteConnection().
If that works it would focus us on the events that the debugger listens to
as
it connects up.
Similarly, we could try connecting a debugger with just a single text
window --
that reduces the number of events involved in the remote connection.
------- Comment #20 From Dave Ray 2007-08-21 20:10:43 (-) [reply] -------
I'm sorry I've caused so much trouble here :)
The funny thing is that this bug doesn't even seem to be affecting us.
Someone
came to me reporting that the Java debugger was locking up when connecting
to a
running agent. I tried it with TankSoar and thought I had reproduced it,
but I
guess I found a different bug. The original bug is intermittent and seems
related to threading or something.
Anyway, when the debugger is frozen you should be able to attach visual
studio
to the java process and hit break to stop all the threads. Then you can
poke
around the stack trace of all the threads and see where it's waiting. I
used
this approach to track down a similar bug (SML call not returning) that
turned
out to be a bizarre Tcl problem.
------- Comment #21 From Bob Marinier 2007-08-22 09:15:02 (-) [reply] -----
--
It appears that, regardless of whether an init-soar has been done or not,
the
receiver thread is not running during a run. Before and after a run it is
running fine (regardless of whether an init-soar has been done). As per
Jon's
and my original analysis, it appears that the run command is executed from
the
receiver thread, and since it is blocking, the receiver thread stops during
the
run.
But then how does the debugger connect at all in some cases? There must be
another thread involved somewhere.
Following Dave Ray's suggestion (I didn't even know there was a threads
window
in Visual Studio), I see that the event thread is also not running (it is
stopped on the line soar_thread::Lock lock(&m_ClientMutex) in
EmbeddedConnectionAsynch::ReceiveMessages. The function
EmbeddedConnectionAsynch::GetResponseForID is also stuck in the while
(wait)
loop.
However, both of these appear to be stuck in the same way even when the
debugger can connect, so I don't think either of these is the problem.
I'm going to put together a minimal example for us to work with.
------- Comment #22 From Bob Marinier 2007-08-22 10:00:43 (-) [reply] -----
--
Ok, here's my minimal example:
#include "sml_Client.h"
using namespace sml;
void main()
{
Kernel* pKernel = Kernel::CreateKernelInNewThread();
Agent* pAgent = pKernel->CreateAgent("test");
pAgent->ExecuteCommandLine("waitsnc -e");
pAgent->InitSoar(); // remove this line to make things work
pKernel->RunAllAgentsForever();
pKernel->Shutdown();
delete pKernel;
}
There are only 4 threads running (in JavaTOH, there were a lot more because
Java itself has a ton of threads):
1) main thread, which is stuck on the line RunAllAgentsForever in the above
(as
expected).
2) ListenerThread, which spends most of it's time on the Sleep line in the
while loop (ListenerThread::Run). This thread shouldn't matter since, even
in
the cases where things hang, we get a new connection.
3) ReceiverThread, which is running the agent. This is looping in the run
scheduler, so the thread itself is stopped and not processing any more
commands.
4) EventThread, which is stuck on the line soar_thread::Lock
lock(&m_ClientMutex) ; in EmbeddedConnectionAsynch::ReceiveMessages. Note
that
this function is also being called repeatedly by the main thread, which is
not
stuck (the function returns successfully when called from the main thread).
So what difference does the init-soar make? Without the init-soar, the
ReceiverThread keeps running, whereas with the init-soar, it does not. If
I
put a breakpoint in EmbeddedConnectionAsynch::ReceiveMessages, it
constantly
gets hit either way. However, without the init-soar, it is being hit by
calls
from the receiver thread and the main thread, whereas with the init-soar,
all
the hits come from the main thread (the receiver thread is stuck on the
lock).
The callstack on the main thread looks like this:
RunAllAgentsForever
ExecuteCommandLine
SendAgentCommand
SendMessageGetResponse
GetResponse
EmbeddedConnectionAsynch::GetResponseForID
EmbeddedConnectionAsynch::ReceiveMessages
It looks like its waiting for the run call to return (which it never wil),
and
the loop that does that checks for new messages. What still isn't clear to
me
is, if the main thread is checking for messages, then why can't the
debugger
connect? It seems it can only connect if the receiver thread is also
checking
for messages. Perhaps the messages the main thread gets are not actually
processed, but are merely put in a queue?
------- Comment #23 From Douglas Pearson 2007-08-22 12:28:52 (-) [reply] --
-----
So first up, let me address the event thread. That's stuck waiting for the
duration of the run intentionally. The client either processes messages
from
its main thread or from the event thread -- but not both at once. When the
run
call blocks, it takes over the job of dispatching events to the appropriate
handlers while waiting for the run to complete. While it's doing this it
locks
out the event thread. If it didn't the client could see two events
occurring
at once--making things more complex at the client level.
So the event thread being blocked is fine and as expected.
The listener thread, as you described, just handles the incoming socket
connections and then add the new connection into a list for processing, so
it's
not a key player.
Alas, I don't remember now exactly what the call sequence should be when
the
debugger connects, but I'm sure the key is still going to be finding out
where
that receiverthread is blocked. I think it must be in an event handler
somewhere that is failing to return--presumably because it's stuck behind a
mutex that the main thread owns. But why that changes after an init-soar
is
anybody's guess.
I can try running the minimal example and trying to trace this myself but
that
won't happen for a few days.
------- Comment #24 From Bob Marinier 2007-08-22 14:39:20 (-) [reply] -----
--
To cut a long story short, the problem is in this gSKI function:
void InputWorkingMemory::Reinitialize()
{
if ( m_rootInputObject != 0 ) {
m_rootInputObject->Release();
m_rootInputObject = 0;
}
ReleaseAllWmes();
ReleaseAllWMObjects();
ReleaseAllSymbols();
}
The callstack that leads here looks like this:
CommandLineInterface::DoInitSoar
gSKI::Agent::Reinitialize
reinitialize_soar
clear_goal_stack
do_input_cycle
soar_invoke_callbacks (the one near the top)
gSKI::InputLink::InputPhaseCallback
gSKI::InputLink::FinalUpdate
gSKI::InputLink::Reinitialize
gSKI::InputWorkingMemory::Reinitialize
To make the problem go away, I have to comment out lines so the function
looks
like this:
void InputWorkingMemory::Reinitialize()
{
if ( m_rootInputObject != 0 ) {
//m_rootInputObject->Release();
//m_rootInputObject = 0;
}
ReleaseAllWmes();
//ReleaseAllWMObjects();
ReleaseAllSymbols();
}
If any of those lines are left in, either the problem persists or there's a
crash. Amazingly, leaving in the m_rootInputObject = 0; line causes the
problem to persist.
I suspect that m_rootInputObject is released twice (once in the function
above,
and once in ReleaseAllWMObjects). That in and of itself probably isn't an
error, but maybe something is waiting on this object, and if it isn't fully
released, something gets stuck?
The upshot may be that, once gSKI is out, maybe this problem will go away.
To
test this, I tried the simple test program on the gSKI removal branch.
Unfortunately, it has exactly the same problem, and exactly the same
changes
make it go away (so I guess gSKI is not entirely out of the loop yet).
Any ideas on what to do next?
PS I'm going out of town in a few hours for the rest of the week, but I
expect
to be able to check email while I'm gone.
------- Comment #25 From Douglas Pearson 2007-08-22 15:19:42 (-) [reply] --
-----
That's great going Bob to track it that far.
I think you're right that we should just wait for gSKI to actually be
pulled
and then see if the problem survives.
gSKI still exists in a parallel universe right now on the gSKI removal
branch,
with a lot of the startup and shutdown still happening--just none of the
real
work going through it any more.
Hopefully when it's finally extracted this problem will just go away.
------- Comment #26 From Dave Ray 2007-08-22 15:33:21 (-) [reply] -------
I'm content with waiting for the gSKI removal branch. This bug isn't a
showstopper for us at this point.
------- Comment #27 From Bob Marinier 2007-08-28 11:41:25 (-) [reply] -----
--
I noticed that, at least with Soar2D, nothing shows up on the input-link at
all
if I make the change described in comment 24. So it's possible that these
lines are not the root of the problem, but merely mask it by eliminating
the
input-link.
Original comment by voigtjr@gmail.com
on 23 Jul 2009 at 4:39
Original comment by voigtjr@gmail.com
on 23 Jul 2009 at 5:29
Original comment by voigtjr@gmail.com
on 23 Feb 2010 at 7:39
Original comment by voigtjr@gmail.com
on 3 Mar 2010 at 4:05
This got fixed at some point, just tested it and it works.
Original comment by voigtjr@gmail.com
on 15 Mar 2010 at 8:09
Original issue reported on code.google.com by
voigtjr@gmail.com
on 23 Jul 2009 at 4:39