mozart / mozart2

Mozart Programming System v2
http://mozart.github.io/
BSD 2-Clause "Simplified" License
566 stars 96 forks source link

Incorrect error message on fullUnify failure #312

Closed layus closed 6 years ago

layus commented 6 years ago

Under heavy load on my machine, I get errors like

%***************************** failure **************************
%**
%** Tell: [1] = [1]

This occurs most reliably with vm tests when many instances are started in parallel.

Here is a gdb trace. The strange part is the unification happening between a reference and a cons. It probably hits https://github.com/layus/mozart2/blob/002924d2bdfc9778d6a61f738455a16351c7a527/vm/vm/main/unify.cc#L437 at some point. On runs where the bug does not occur, there is no unification happening on values with different types.

Is it normal to try to unify a reference ? Is that a bug or should I look elsewhere ?

(gdb) frame 1
#1  0x0000000000590a94 in mozart::fullUnify (vm=0x963c58, left=..., right=...) at /home/gmaudoux/projets/mozart-build/mozart2/vm/vm/main/unify.cc:88
88      fail(vm, info);
(gdb) p (*left._node)
$1818 = {{data = {type = {_info = 0x936b70 <mozart::RawType<mozart::Cons>::rawType>}, value = {it = 140737284825062, next = {it = 230, next = {
            it = 6.9533457521037724e-310, next = {it = {<No data fields>}, next = {it = {_impl = 0x7ffff3de5fe6}, next = {it = {
                    space = 0x7ffff3de5fe6}, next = {it = 0x7ffff3de5fe6}}}}}}}}, {grNext = 0x936b70 <mozart::RawType<mozart::Cons>::rawType>, 
      grFrom = 0x7ffff3de5fe6}}}
(gdb) p (*right._node)
$1819 = {{data = {type = {_info = 0x936350 <mozart::RawType<mozart::Reference>::rawType>}, value = {it = 140737284825094, next = {it = 6, next = {
            it = 6.9533457521053534e-310, next = {it = {<No data fields>}, next = {it = {_impl = 0x7ffff3de6006}, next = {it = {
                    space = 0x7ffff3de6006}, next = {it = 0x7ffff3de6006}}}}}}}}, {grNext = 0x936350 <mozart::RawType<mozart::Reference>::rawType>, 
      grFrom = 0x7ffff3de6006}}}
(gdb) 

/cc @eregon, @sjrd

sjrd commented 6 years ago

Is it normal to try to unify a reference ?

Absolutely not; it is always a bug. In fact having a RichNode that is observed to be a Reference is always a bug. The RichNode constructor automatically dereferences References, all the way down the chain if there is a chain. If the node is turned into a Reference later, the RichNode is invalid and cannot be used anymore, or one needs to call its update() method.

layus commented 6 years ago

Any idea where to look ? It should not happen, but happens nonetheless...

sjrd commented 6 years ago

printf's along the way from where those RichNodes are first built, up to whether they're observed to be References ... That's how I would start.

layus commented 6 years ago

The printf of the rich node will show it's type ?

Because the path is quite simple, it arises in {VM.list} = [1]. So a builtin call and a unification right after.

sjrd commented 6 years ago

The printf of the rich node will show it's type ?

Yes.

And I mean the path within C++ code of unify.cc. RichNodes never survive across two opcodes.

layus commented 6 years ago

More from this issue. The unification happens on [1 2] = [1]. Because unification rebinds left to right, the error records somehow ends up being eq([1] [1]). cleanupOnFailure(vm) does not work correctly, as the reference still exists after the undoBindings. This led me to further investigate how rebind works. Apparently, rebind undo is based on pointers and node copies. Seems hacky but not faulty. I am more worried about the code for rebinding itself.

void StructuralDualWalk::rebind(VM vm, RichNode left, RichNode right) {                                                                                                   
  rebindTrail.push_back(vm, left.makeBackup());                                                                                                                           
  left.reinit(vm, right);                                                                                                                                                 
}

so we backup left, then overwrite it with right. But reinit has some code paths that do modify right in that setup (right is not a const parameter)

Just two excerpts from https://github.com/mozart/mozart2/tree/master/vm/vm/main/store.hh:

void RichNode::reinit(VM vm, StableNode& from) {                                                                                                                          
  if (node() == &from) {                                                                                                                                                  
    // do nothing                                                                                                                                                         
  } else if (from.isCopyable()) {                                                                                                                                         
    node()->set(from);                                                                                                                                                    
  } else {                                                                                                                                                                
    node()->make<Reference>(vm, &from);                                                                                                                                   
  }                                                                                                                                                                       
}                                                                                                                                                                         

void StableNode::init(VM vm, UnstableNode& from) {                                                                                                                        
  set(from);                                                                                                                                                              
  if (!isCopyable())                                                                                                                                                      
    from.make<Reference>(vm, this);                                                                                                                                       
}                                                                                                                                                                         

In the first case, a pointer to from is taken for reference, which may or may not work with undo, and in the second case from is modified, which should never, ever happen as undo is based on resetting the left operand, and not the right one. Now I am unable to follow the various templates and overrides to check which ones are truly called in the end, but this rebinding gets in the way here. The real error lies behind, with the [1 2] = [1] but this was painful to debug.

layus commented 6 years ago
Dual walk (unify) on [1 9 18] and [1]                              <- OK
void mozart::RichNode::reinit(mozart::VM, mozart::RichNode)
void mozart::RichNode::reinit(mozart::VM, mozart::UnstableNode &)
void mozart::StableNode::init(mozart::VM, mozart::UnstableNode &)
After rebinding, we have [1] and <Reference>                       <- OK
Failed dual walk (unify) on [1 9 18] and <Reference>               <- NOOOOO ;)

So, the three functions are:

void RichNode::reinit(VM vm, RichNode from) {
  std::cout << BOOST_CURRENT_FUNCTION << "\n";
  if (from.isStable())
    reinit(vm, from.asStable());
  else
    reinit(vm, from.asUnstable());
}

void RichNode::reinit(VM vm, UnstableNode& from) {
  std::cout << BOOST_CURRENT_FUNCTION << "\n";
  if (node() == &from) {
    // do nothing
  } else if (isStable()) {
    asStable().init(vm, from);
  } else {
    asUnstable().init(vm, from);
  }
}

void StableNode::init(VM vm, UnstableNode& from) {
  std::cout << BOOST_CURRENT_FUNCTION << "\n";
  set(from);
  if (!isCopyable())
    from.make<Reference>(vm, this);
}

The last line is the culprit as it is incompatible with the undoBindings logic. Now we can either wonder at why a simple list from VM.list is non-copyable, or fix this forever. In both cases I need some further insights.

sjrd commented 6 years ago

wonder at why a simple list from VM.list is non-copyable

"copyable" in mozart2 means that the 2 words of a Node can be freely copied to another Node, without having to take care of any other storage behind (also the identity must not matter). In practice, very few data types are copyable (small ints, booleans, References, and one or two more). Even a | is not copyable because it needs extra storage for its head and tail. So it's normal that the list from VM.list is non-copyable.

layus commented 6 years ago

@sjrd Thanks for the clarification. So a.reinit(vm, b) modifies b in most cases when a is stable and b is unstable.

First attempt at solving:

 void StructuralDualWalk::rebind(VM vm, RichNode left, RichNode right) {
+  if (right.isStable()) {
+      std::cout << "copy right\n";
       rebindTrail.push_back(vm, left.makeBackup());
       left.reinit(vm, right);
+  } else if (left.isStable()) {
+      std::cout << "copy left\n";
+      rebindTrail.push_back(vm, right.makeBackup());
+      right.reinit(vm, left);
+  } else {
+    std::cout << "no rebind\n";
+  }
 }

It fixes my bug, but triggers failures in other test cases like mozart2/vm/vm/main/runnable.hh:139: virtual void mozart::Runnable::resume(bool): Assertion `!_dead && !_terminated' failed. in base/pickle.oz.

layus commented 6 years ago

What is this Stable vs Unstable distinction by the way ?

sjrd commented 6 years ago

Stable is a fixed thing, it's like an Oz variable. Once it's initialized, it can only evolve in ways that are compatible to how Oz variables evolve. For example, it can start its life as an OptVar, then later be bound to a |, at which point it's immutable. StableNodes are always allocated on the VM heap, either isolated (as when constructed with new (vm) StableNode(...)) as part of bigger data structures, where said structures are immutable from an Oz point of view. For example, the head and tail of a | are StableNodes.

Unstable is mutable. It's like the contents of an Oz Cell. You can wholesale replace the content of an UnstableNode with something completely unrelated. Most UnstableNodes live on the C++ stack, but some also live in the heap for things which are mutable from an Oz point of view. The contents of a Cell is an obvious example.

Since Unstable nodes can change, they cannot be pointed to by References (otherwise, changing the contents of the UnstableNode would destroy the Reference).

layus commented 6 years ago

Okay, the following code seems to work without breaking other tests. I have no idea why, but if no-one complains, this is what I intend to PR.

 void StructuralDualWalk::rebind(VM vm, RichNode left, RichNode right) {
+  if (right.isStable()) {
       rebindTrail.push_back(vm, left.makeBackup());
       left.reinit(vm, right);
+  } else {
+      rebindTrail.push_back(vm, right.makeBackup());
+      right.reinit(vm, left);
+  }
 }