sleyzerzon / soar

Automatically exported from code.google.com/p/soar
1 stars 0 forks source link

Debugger freezes when connecting to running kernel #23

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Description From Dave Ray 2007-08-15 11:44:36 (-) [reply]
Steps to reproduce:

* Run TankSoar and create a new tank with obscure-bot.soar and no debugger
* Hit run to start the TankSoar sim
* Run the java debugger and connect to the running kernel in TankSoar.
* The debugger will freeze until TankSoar is stopped from the TankSoar
interface.

This is a problem if you have Soar embedded in a free-running program 
without
start and stop controls. The debugger should be able to connect to a 
running
kernel and inspect the agents as they run without freezing.
------- Comment #1 From Douglas Pearson 2007-08-15 12:10:23 (-) [reply] ---
----
Back with the 8.6.0/8.6.1 releases we did a good deal of testing of these 
sorts
of connecting and disconnecting from running kernels, so this should be
possible.

I'd guess that there's been a change somewhere causing the debugger to get
stuck behind a mutex.  Either that or it could be just getting starved if 
your
CPU is pegged to 100% during the tank soar run, but that seems less likely.
------- Comment #2 From Bob Marinier 2007-08-15 14:01:28 (-) [reply] ------
-
I've gone back to earlier versions and tested this with the following 
results:

8.6.1
Eaters (Tcl version): works fine
There was no tanksoar for 8.6.1

8.6.2
Eaters: works fine
TankSoar: doesn't work

8.6.3 (Java Soar2D version)
Eaters: doesn't work
TankSoar: doesn't work

Trunk:
Eaters: doesn't work
TankSoar: doesn't work

It's pretty interesting that one works and one doesn't in 8.6.2.

As for cpu starvation, in all cases, the environment only uses around 60% 
of my
cpu.  In the trunk version (release build), the debugger uses around 20% 
(even
though it's not visibly doing anything).  In the older versions it uses 0%.
------- Comment #3 From Jonathan Voigt 2007-08-15 16:33:29 (-) [reply] ----

---
This call is hanging. I am investigating.

SoarJavaDebugger:Document.java:662: m_Kernel =
Kernel.CreateRemoteConnection(true, ipAddress, portNumber, ignoreOutput) ;
------- Comment #4 From Jonathan Voigt 2007-08-16 10:49:06 (-) [reply] ----

---
Things behave even worse if the kernel is created in current thread: 

kernel = Kernel.CreateKernelInCurrentThread("SoarKernelSML", true);

The debugger will not connect if that is how the kernel is created.
------- Comment #5 From Jonathan Voigt 2007-08-16 11:05:04 (-) [reply] ----

---
I think this might have something to do with it:

// While this thread is alive, keep checking for incoming messages
// and executing them.  These messages could be coming in from a remote 
socket
// or from an embedded connection.  If we're executing on the client thread
// for an embedded connection (the "sycnhronous" model) then this thread is
shut down
// and it's the client's responsibility to check for incoming messages by
calling
// "GetIncomingCommands" periodically.
------- Comment #6 From Jonathan Voigt 2007-08-16 11:07:07 (-) [reply] ----

---
The GetIncomingCommands method only needs to be called when
CreateKernelInNewThread is used to create the kernel.
------- Comment #7 From Dave Ray 2007-08-16 11:09:56 (-) [reply] -------
(In reply to comment #6)
> The GetIncomingCommands method only needs to be called when
> CreateKernelInNewThread is used to create the kernel.
> 

If you use CreateKernelInCurrentThread() then you have to manually poll the
network connection with CheckForIncomingCommands().
------- Comment #8 From Jonathan Voigt 2007-08-16 11:14:40 (-) [reply] ----

---
I said that backwards. Calling GetIncomingCommands is only necessary when 
using
CreateKernelInCurrentThread.
------- Comment #9 From Jonathan Voigt 2007-08-16 13:01:35 (-) [reply] ----

---
The new debugger connection is not getting added to the list of connections
checked by the receiver thread until after soar is stopped.
------- Comment #10 From Jonathan Voigt 2007-08-16 13:10:12 (-) [reply] ---
----
The previous comment was wrong.

ConnectionManager::ReceiveAllMessages() is not getting called during a run.
------- Comment #11 From Bob Marinier 2007-08-16 14:14:33 (-) [reply] -----
--
Jon and I have been working on this, and it turns out that you don't even 
need
the debugger to cause a problem, if you know where to look.

The issue is that, when Soar is running, the ReceiverThread stops checking 
for
messages.  Specifically, this is the callstack as far as we've traced it:

ReceiverThread::Run
ConnectionManager::ReceiveAllMessages
EmbeddedConnectionAsynch::ReceiveMessages
Connection::InvokeCallbacks
Callback::Invoke

At this point, a function pointer is called and doesn't come back until 
Soar
stops.  When there are no clients trying to connect, the fact that this is
stuck is invisible to the user.  This is especially difficult to debug 
because
EmbeddedConnectionAsynch::ReceiveMessages is repeatedly called by other 
threads
that are not blocked, so you can't just set a breakpoint.

What we need to determine is:
What function is being called?
Is this is a general problem for any callback or just the one being called
here?
Is this a Java/SWIG issue (e.g., if the callback is running code in Java)?

If this is a problem specific to the callback that is blocking, it would
explain why Eaters works and TankSoar doesn't in 8.6.2.
------- Comment #12 From Jonathan Voigt 2007-08-16 14:31:13 (-) [reply] ---
----
(In reply to comment #11)
> What function is being called?

It is ultimately, after many layers, the call for the command line "run -i 
o"
(run interleave by output)

The same thread that calls this is responsible for accepting messages from 
the
new debugger connection.

Since the call to run blocks, this thread does not get to process messages
until after the run returns (soar stops).
------- Comment #13 From Jonathan Voigt 2007-08-16 14:39:21 (-) [reply] ---
----
Successful workaround:

In the update callback (registered by Soar2D for event
smlEVENT_AFTER_ALL_GENERATED_OUTPUT or smlEVENT_AFTER_ALL_OUTPUT_PHASES), 
call 

kernel.CheckForIncomingCommands();

Here's the code in Soar2D that works:
Controller.java:272
public void updateEventHandler( ... ) {

...

if (stop) { // this is a member var true if the environment wants to stop

    ...

    kernel.StopAllAgents();
} else {
    // this is the new code
    // since we don't want to stop, check for commands
    kernel.CheckForIncomingCommands();
}

A general solution would be to have this call (CheckForIncomingCommands()) 
in
the run loop in the kernel or SML or wherever it is.
------- Comment #14 From Jonathan Voigt 2007-08-21 16:11:06 (-) [reply] ---
----
News bulletin: I think init-soar is causing the problem.

When Soar2D creates an agent, it does an init-soar during creation. 
Commenting
out this init-soar and sourcing an agent that doesn't do an init-soar on 
load
makes this problem go away.

obscure-bot.soar contains the following:
  rete-net --load "obscure-bot.soarx"
  init-soar

obscure-bot does not appear to work without that init-soar.

Changes in revision 7587:
soar2d.player.tanksoar.SoarTank.java:324:
        //agent.InitSoar();
soar2d.player.eaters.SoarEater.java:645:
        //agent.InitSoar();

I can't remember why I put these init-soar calls there. Removal doesn't 
seem to
affect things immediately. I'll evaluate this later. In the mean time, 
rebuild:
1) In trunk/SoarSuite working copy, update to revision 7587 or better

TankSoar:
2) Build Soar2D and run tanksoar
3) Create agent tutorial/simple-bot.soar (don't spawn debugger)
4) run
5) Start a debugger, connect remotely
6) Debugger connects as expected

Eaters:
7) Run eaters
8) Create agent tutorial/advanced-move.soar (don't spawn debugger)
9) run
10) Start a debugger, connect remotely
11) Debugger connects as expected

TankSoar (obscure-bot):
12) Run tanksoar
13) Create agent obscure-bot.soar (don't spawn debugger)
14) run
15) Start a debugger, connect remotely
16) Observe hang, stop soar

Obscure-bot:
17) Remove init-soar line from obscure-bot.soar, save
18) Run tanksoar
19) Create agent obscure-bot.soar (SPAWN debugger)
20) run
21) observe malfunctioning obscure-bot
------- Comment #15 From Douglas Pearson 2007-08-21 17:27:32 (-) [reply] --
-----
Needless to say that's pretty bizarre as presumably the init-soar has 
completed
long before the debugger comes along to connect.

This suggests to me that the environment (or embedded kernel in the
environment) is getting into some invalid state during the init-soar -- and
then when the debugger tries to connect we get tied into that problem and 
hang.

Does this same behavior happen for any agent with a trailing init-soar 
call, or
just for agent's using a rete-net load? 

What happens if you do two init-soars in a row.  Does the second call crash 
or
leak?  If so that would be a clue that the first init-soar hadn't really 
worked
correctly.

Original issue reported on code.google.com by voigtjr@gmail.com on 23 Jul 2009 at 4:39

GoogleCodeExporter commented 8 years ago
------- Comment #16 From Bob Marinier 2007-08-21 18:32:09 (-) [reply] -----
--
The problem is not tied to rete-net; an independent init-soar will cause 
the
problem.  (Also, in case there is still some doubt, we removed all 
callbacks
and the problem still persisted).

I found that JavaTOH appears to work fine, but it will also fail if you do 
an
init-soar before running (e.g., by clicking the reset button or by 
connecting a
debugger, doing an init-soar, and then disconnecting).

Here are some cases (these are much easier to do with a debug build; the
release build finishes too fast):

1) Start JavaTOH, run.  Connect debugger. (succeeds)
2) Start JavaTOH, reset, run.  Connect debugger. (fails)
3) Start JavaTOH, connect debugger, init-soar from debugger any number of
times.  Disconnect debugger, run.  Connect debugger. (fails)

For the last case, it doesn't actually matter if you disconnect the 
debugger. 
Leaving it connected and trying to connect a second debugger also fails.
------- Comment #17 From Douglas Pearson 2007-08-21 18:58:21 (-) [reply] --
-----
Wow, that's a pretty nasty bug and could have been there for a while I 
guess as
the normal test is to launch the environment, run the agent and connect the
debugger and it sounds like that still works.

Putting together all of these bits of evidence it sounds to me like there's 
a
problem in one of the internal event handlers (i.e. inside the C++) and 
after
an init-soar it's causing that event handler to hang during the next run.

I can't think of any easy way to figure out where it's stuck.  There's not 
a
ton of internal callbacks -- off the top of my head I can think of print 
and
XML handlers, run scheduler, input/output callbacks and init-soar 
listeners.

Removing all of the callbacks from the environment was a nice step as that
rules out a lot more possible cases.

A couple more questions:
-- Just to be clear, after those init-soar trials, there's no problem
connecting the debugger if the agent's not actively running, right?
-- Can you see if the receiver thread is already locked up after the init-
soar
but before the next run?  Also, what if you just do a "run 10" and then 
connect
the debugger after that.  Is it OK in those cases?  I'm assuming the answer
will be yes, which unfortunately makes this much harder to solve.
------- Comment #18 From Bob Marinier 2007-08-21 19:05:33 (-) [reply] -----
--
> A couple more questions:
> -- Just to be clear, after those init-soar trials, there's no problem
> connecting the debugger if the agent's not actively running, right?

Right.  In fact, as soon as the agent stops, a "hung" debugger will
successfully connect (even if Windows says it's not responding).

> -- Can you see if the receiver thread is already locked up after the 
init-soar
> but before the next run?  Also, what if you just do a "run 10" and then 
connect
> the debugger after that.  Is it OK in those cases?  I'm assuming the 
answer
> will be yes, which unfortunately makes this much harder to solve.

I'm not sure how to check if the receiver thread is already locked up. 
Unfortunately, I'm leaving for the rest of the evening right now, but I'll
check these things either later tonight or first thing in the morning.
------- Comment #19 From Douglas Pearson 2007-08-21 19:15:17 (-) [reply] --
-----
I think if you attach the debugger (Visual C++) and set a break point in 
the
ReceiverThread::run() method you should be able to see if it's still 
looping or
if that thread is stuck in a callback.

If it's stuck in a callback before the run starts that would be great as we
could figure out where it's stuck.  Much harder to do once the run begins 
and
there's lots of traffic.

Another test we could try is connecting something simpler through the 
remote
path--e.g. a simple C++ test program that just called 
CreateRemoteConnection().
 If that works it would focus us on the events that the debugger listens to 
as
it connects up.

Similarly, we could try connecting a debugger with just a single text 
window --
that reduces the number of events involved in the remote connection.
------- Comment #20 From Dave Ray 2007-08-21 20:10:43 (-) [reply] -------
I'm sorry I've caused so much trouble here :)
The funny thing is that this bug doesn't even seem to be affecting us.  
Someone
came to me reporting that the Java debugger was locking up when connecting 
to a
running agent.  I tried it with TankSoar and thought I had reproduced it, 
but I
guess I found a different bug. The original bug is intermittent and seems
related to threading or something.

Anyway, when the debugger is frozen you should be able to attach visual 
studio
to the java process and hit break to stop all the threads. Then you can 
poke
around the stack trace of all the threads and see where it's waiting. I 
used
this approach to track down a similar bug (SML call not returning) that 
turned
out to be a bizarre Tcl problem. 
------- Comment #21 From Bob Marinier 2007-08-22 09:15:02 (-) [reply] -----
--
It appears that, regardless of whether an init-soar has been done or not, 
the
receiver thread is not running during a run.  Before and after a run it is
running fine (regardless of whether an init-soar has been done).  As per 
Jon's
and my original analysis, it appears that the run command is executed from 
the
receiver thread, and since it is blocking, the receiver thread stops during 
the
run.

But then how does the debugger connect at all in some cases?  There must be
another thread involved somewhere.

Following Dave Ray's suggestion (I didn't even know there was a threads 
window
in Visual Studio), I see that the event thread is also not running (it is
stopped on the line soar_thread::Lock lock(&m_ClientMutex) in
EmbeddedConnectionAsynch::ReceiveMessages.  The function
EmbeddedConnectionAsynch::GetResponseForID is also stuck in the while 
(wait)
loop.

However, both of these appear to be stuck in the same way even when the
debugger can connect, so I don't think either of these is the problem.

I'm going to put together a minimal example for us to work with.
------- Comment #22 From Bob Marinier 2007-08-22 10:00:43 (-) [reply] -----
--
Ok, here's my minimal example:

#include "sml_Client.h"

using namespace sml;

void main()
{
    Kernel* pKernel = Kernel::CreateKernelInNewThread();
    Agent* pAgent = pKernel->CreateAgent("test");
    pAgent->ExecuteCommandLine("waitsnc -e");
    pAgent->InitSoar();  // remove this line to make things work
    pKernel->RunAllAgentsForever();
    pKernel->Shutdown();
    delete pKernel;
}

There are only 4 threads running (in JavaTOH, there were a lot more because
Java itself has a ton of threads):

1) main thread, which is stuck on the line RunAllAgentsForever in the above 
(as
expected).

2) ListenerThread, which spends most of it's time on the Sleep line in the
while loop (ListenerThread::Run).  This thread shouldn't matter since, even 
in
the cases where things hang, we get a new connection.

3) ReceiverThread, which is running the agent.  This is looping in the run
scheduler, so the thread itself is stopped and not processing any more
commands.

4) EventThread, which is stuck on the line soar_thread::Lock
lock(&m_ClientMutex) ; in EmbeddedConnectionAsynch::ReceiveMessages.  Note 
that
this function is also being called repeatedly by the main thread, which is 
not
stuck (the function returns successfully when called from the main thread).

So what difference does the init-soar make?  Without the init-soar, the
ReceiverThread keeps running, whereas with the init-soar, it does not.  If 
I
put a breakpoint in EmbeddedConnectionAsynch::ReceiveMessages, it 
constantly
gets hit either way.  However, without the init-soar, it is being hit by 
calls
from the receiver thread and the main thread, whereas with the init-soar, 
all
the hits come from the main thread (the receiver thread is stuck on the 
lock).

The callstack on the main thread looks like this:

RunAllAgentsForever
ExecuteCommandLine
SendAgentCommand
SendMessageGetResponse
GetResponse
EmbeddedConnectionAsynch::GetResponseForID
EmbeddedConnectionAsynch::ReceiveMessages

It looks like its waiting for the run call to return (which it never wil), 
and
the loop that does that checks for new messages.  What still isn't clear to 
me
is, if the main thread is checking for messages, then why can't the 
debugger
connect?  It seems it can only connect if the receiver thread is also 
checking
for messages.  Perhaps the messages the main thread gets are not actually
processed, but are merely put in a queue?
------- Comment #23 From Douglas Pearson 2007-08-22 12:28:52 (-) [reply] --
-----
So first up, let me address the event thread.  That's stuck waiting for the
duration of the run intentionally.  The client either processes messages 
from
its main thread or from the event thread -- but not both at once.  When the 
run
call blocks, it takes over the job of dispatching events to the appropriate
handlers while waiting for the run to complete.  While it's doing this it 
locks
out the event thread.  If it didn't the client could see two events 
occurring
at once--making things more complex at the client level.

So the event thread being blocked is fine and as expected.

The listener thread, as you described, just handles the incoming socket
connections and then add the new connection into a list for processing, so 
it's
not a key player.

Alas, I don't remember now exactly what the call sequence should be when 
the
debugger connects, but I'm sure the key is still going to be finding out 
where
that receiverthread is blocked.  I think it must be in an event handler
somewhere that is failing to return--presumably because it's stuck behind a
mutex that the main thread owns.  But why that changes after an init-soar 
is
anybody's guess.

I can try running the minimal example and trying to trace this myself but 
that
won't happen for a few days.
------- Comment #24 From Bob Marinier 2007-08-22 14:39:20 (-) [reply] -----
--
To cut a long story short, the problem is in this gSKI function:

void InputWorkingMemory::Reinitialize()
   {
      if ( m_rootInputObject != 0 ) {
         m_rootInputObject->Release();
         m_rootInputObject = 0;
      }
      ReleaseAllWmes();
      ReleaseAllWMObjects();
      ReleaseAllSymbols();
   }

The callstack that leads here looks like this:

CommandLineInterface::DoInitSoar
gSKI::Agent::Reinitialize
reinitialize_soar
clear_goal_stack
do_input_cycle
soar_invoke_callbacks (the one near the top)
gSKI::InputLink::InputPhaseCallback
gSKI::InputLink::FinalUpdate
gSKI::InputLink::Reinitialize
gSKI::InputWorkingMemory::Reinitialize

To make the problem go away, I have to comment out lines so the function 
looks
like this:

void InputWorkingMemory::Reinitialize()
   {
      if ( m_rootInputObject != 0 ) {
         //m_rootInputObject->Release();
         //m_rootInputObject = 0;
      }
      ReleaseAllWmes();
      //ReleaseAllWMObjects();
      ReleaseAllSymbols();
   }

If any of those lines are left in, either the problem persists or there's a
crash.  Amazingly, leaving in the m_rootInputObject = 0; line causes the
problem to persist.

I suspect that m_rootInputObject is released twice (once in the function 
above,
and once in ReleaseAllWMObjects).  That in and of itself probably isn't an
error, but maybe something is waiting on this object, and if it isn't fully
released, something gets stuck?

The upshot may be that, once gSKI is out, maybe this problem will go away.  
To
test this, I tried the simple test program on the gSKI removal branch. 
Unfortunately, it has exactly the same problem, and exactly the same 
changes
make it go away (so I guess gSKI is not entirely out of the loop yet).

Any ideas on what to do next?

PS I'm going out of town in a few hours for the rest of the week, but I 
expect
to be able to check email while I'm gone.
------- Comment #25 From Douglas Pearson 2007-08-22 15:19:42 (-) [reply] --
-----
That's great going Bob to track it that far.

I think you're right that we should just wait for gSKI to actually be 
pulled
and then see if the problem survives.

gSKI still exists in a parallel universe right now on the gSKI removal 
branch,
with a lot of the startup and shutdown still happening--just none of the 
real
work going through it any more.

Hopefully when it's finally extracted this problem will just go away.
------- Comment #26 From Dave Ray 2007-08-22 15:33:21 (-) [reply] -------
I'm content with waiting for the gSKI removal branch. This bug isn't a
showstopper for us at this point.
------- Comment #27 From Bob Marinier 2007-08-28 11:41:25 (-) [reply] -----
--
I noticed that, at least with Soar2D, nothing shows up on the input-link at 
all
if I make the change described in comment 24.  So it's possible that these
lines are not the root of the problem, but merely mask it by eliminating 
the
input-link.

Original comment by voigtjr@gmail.com on 23 Jul 2009 at 4:39

GoogleCodeExporter commented 8 years ago

Original comment by voigtjr@gmail.com on 23 Jul 2009 at 5:29

GoogleCodeExporter commented 8 years ago

Original comment by voigtjr@gmail.com on 23 Feb 2010 at 7:39

GoogleCodeExporter commented 8 years ago

Original comment by voigtjr@gmail.com on 3 Mar 2010 at 4:05

GoogleCodeExporter commented 8 years ago
This got fixed at some point, just tested it and it works.

Original comment by voigtjr@gmail.com on 15 Mar 2010 at 8:09