sleyzerzon / soar

Automatically exported from code.google.com/p/soar
1 stars 0 forks source link

properly fix watch 5 with --fullwmes workaround #12

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Description From Bob Marinier 2006-03-29 16:32:15 (-) [reply]
1) Start TestCLI
2) source toh
3) w 5
4) w --fullwmes
5) run

Around dc 2 we segfault with the following error message:

Internal Soar Error:  symbol_to_string called on bad symbol
Soar cannot recover from this error.
You will have to restart Soar to run an agent.
Data is still available for inspection, but may be corrupt.
If a log was open, it has been closed for safety.Internal Soar Error:  
symbol_to
_string called on bad symbol
Soar cannot recover from this error.
You will have to restart Soar to run an agent.
Data is still available for inspection, but may be corrupt.
If a log was open, it has been closed for safety.

Note that this was originally encountered in Java TankSoar so it's not 
specific
to TestCLI.

This works fine in 8.5.2.
------- Comment #1 From Bob Marinier 2006-03-29 17:34:38 (-) [reply] ------
-
Further testing reveals that this does not crash in 8.6.1, but it does 
crash in
8.6.2-r0 (from 10/21/2005; I think Laird was the only one to see that 
build, but
I still have it if people want it).
------- Comment #2 From Jonathan Voigt 2006-03-31 14:54:04 (-) [reply] ----

---
Increased severity, this is impeding JavaTankSoar development.
------- Comment #3 From Karen Coulter 2006-03-31 15:18:41 (-) [reply] -----
--
This is crashing on a retraction of a proposal.  The ref counts of the wme 
are
already 0, and the ID is invalid.  The wme timetag is still exists, so 
that's
why it can be printed, and there is no crash if --fullwmes not on.  Not 
sure why
the ref counts are already gone -- someone is decrementing WMEs and IDs 
when
they shouldn't.  

This bug is much bigger than the print problem.
------- Comment #4 From Douglas Pearson 2006-03-31 15:46:25 (-) [reply] ---
----
Do we know what wme it's failing on...in particular if it's an input wme?
------- Comment #5 From Karen Coulter 2006-03-31 15:57:01 (-) [reply] -----
--
Not an input wme.  This example happens in Soar Demos TOH, no I/O involved.

retracting (36: S1 ^last-disk-moved D2)
------- Comment #6 From Karen Coulter 2006-03-31 16:44:19 (-) [reply] -----
--
The reference count goes from around 283, to all of a sudden being invalid, 
so
I'm beginning to wonder if there isn't a memory overwrite problem with this
combination of watch settings.  Don't ask me why I think that, I just do. 
Anybody got any ideas?
------- Comment #7 From Karen Coulter 2006-04-07 15:55:40 (-) [reply] -----
--
An interim workaround is to build SoarKernel with

#define DO_TOP_LEVEL_REF_COUNTS

in kernel.h

I haven't nailed down the exact solution yet.  The print is coming too late
during the retraction.  I think this happens only to wmes on the top state
because they don't have any ref cts from justifications that keep them from
being deallocated before the print call.  It all has to do with preferences 
and
clones and when they exist or don't.
------- Comment #8 From Karen Coulter 2006-05-04 13:23:51 (-) [reply] -----
--
Soar was crashing when trying to --fullwmes print the LHS conds of a
retraction,
but at least one of those wmes has already been removed from the rete, and
possibly deallocated.

When DO_TOP_LEVEL_REF_CTS is defined, wmes at the top level are never
deallocated, therefore prefs aren't deallocated, and the chain hangs around
forever; so print --fullwmes would always work.  I've looked at adding and
removing a ref count on cond->bt.wme_ but the least risky fix for the 
release
is
to not print full conds when retracting instantiations.  I've checked in 
that
change so that Soar won't crash.  changing to non-blocking.
------- Comment #9 From Jonathan Voigt 2008-06-18 12:25:06 (-) [reply] ----

---
The symbol DO_TOP_LEVEL_REF_COUNTS does not exist in the trunk code. The 
crash
doesn't happen anymore. Are we going to fix this further?
------- Comment #10 From Bob Marinier 2008-06-18 13:34:27 (-) [reply] -----
--
Actually, it should be DO_TOP_LEVEL_REF_CTS, which does appear to still be 
in
the kernel code (there's a #define for it in kernel.h, which is commented 
out).

According to Karen's last comment, she worked around the problem by not
printing everything (hence it no longer crashes).  It's not clear whether 
we
should accept that as the final fix or not (it's not clear what 
alternatives
exist).

Original issue reported on code.google.com by voigtjr@gmail.com on 23 Jul 2009 at 4:19

GoogleCodeExporter commented 8 years ago

Original comment by voigtjr@gmail.com on 23 Jul 2009 at 5:29

GoogleCodeExporter commented 8 years ago

Original comment by voigtjr@gmail.com on 23 Feb 2010 at 7:39

GoogleCodeExporter commented 8 years ago

Original comment by voigtjr@gmail.com on 23 Feb 2010 at 8:11

GoogleCodeExporter commented 8 years ago

Original comment by voigtjr@gmail.com on 3 Mar 2010 at 4:05