lab-cosmo / librascal

A scalable and versatile library to generate representations for atomic-scale learning
https://lab-cosmo.github.io/librascal/
GNU Lesser General Public License v2.1
80 stars 20 forks source link

Memory error when dereferencing StructureManager iterators (unintuitive behaviour) #97

Open max-veit opened 5 years ago

max-veit commented 5 years ago

This is related to a recent segfault, just fixed by @agoscinski in ff03d03816e, but the behaviour of the underlying iterators that led to the segfault is unintuitive and dangerous enough that I think it deserves to be classified as a bug.

The problem arises when doing an iteration over pairs (i.e. over centers, then neighbours) by using the StructureManager iterators explicitly rather than by using nested for-loops. Apparently, the inner iterator created this way: https://github.com/cosmo-epfl/librascal/blob/5fc27332dc2175334eed6de3ea8bbcfeeedd756c/tests/test_representation_manager.hh#L550 doesn't know the type of the containing StructureManager anymore (???) so dereferencing it may produce a segfault. This is especially unintuitive to me, as I'd expect the same pattern to work just fine with other nested-iterable structures (e.g. vectors of vectors).

What I'm looking for is either 1) a convincing explanation of why this explicit iteration pattern is the Wrong Thing for C++ iterables in general, or 2) a fix that makes neighbourlist-adapted StructureManagers compatible with the general intuition for iterating over nested structures.

mastricker commented 5 years ago

Hi,

I'll have a look. I can not seem to find the file in the tree - it is not in master. Is it in SOAP- gradients? With your link you are referring to the not corrected version. ff03d03816e has the fix for that part, right?

So even with the fix it still may segfault?

I recall vaguely that there was a similar issue when writing the adaptor_increase_maxorder.

agoscinski commented 5 years ago

@mastricker yes it is in feat/soap_gradients, yes ff03d03816e fixed it. The usage of the function get_iterator_at as in adaptor_increase_maxorder fixed it. The usage as @max-veit linked it, does result in a broken cluster_indices_container, because the actual structure manager type is lost and only StructureManagerBase remains as type. Why it forgets the SM type is not clear yet.

mastricker commented 5 years ago

So, it is fixed -- can we include that fix?

I will still look at it.

agoscinski commented 5 years ago

I think the problem max mentioned was that: His code seems valid to write, but results in a segfault. So even this code should work, or there should be a reason why this code pattern is not allowed. Especially, because now two people invested a lot of time into this error, we should make somehow sure that this does not happen again for a person who is not familiar with this issue.

mastricker commented 5 years ago

I see. I'll try to figure it out.

mastricker commented 5 years ago

Ok, coming back to this. As far as I understand, the following statement should be invalid to get an object. (manager.begin())->begin() Because: manager.begin() returns an iterator. To dereference that one, you should do something like *{manager.begin()}. That is what for (auto a : mngr) {} does. What you get as a is a dereference of manager.begin(). The pair is the dereference of the iterator of the atom for (auto p : a){}. Now p in usage is a dereference of the iterator over a.

In any case, I think the dereference step is missing. .begin() only returns an iterator, which has to be dereferenced to become an object, which in turn can be used as an iterator (the dereference of the iterator over the atom is a pair).

As a vector and only one step:

std::vector<Object> vec;
std::vector<Object>::iterator iterator;
iterator = vec.begin();
Object o = *it;

Example in our case which worked in the test (exchanged the pair with pair2 in the calculator) including all steps:

    // doesn't work, dunno why, see nested for loops below for replacement hack
    // auto & pair = (manager.begin())->begin();
    auto && it_mngr{manager->begin()};
    auto && atom{*it_mngr};
    auto && it_atom{atom.begin()};
    auto && pair2{*it_atom};

or in one line

auto && pair2{*(*(manager->begin())).begin()};

See e.g. So what that first statement did is not defined. The 2nd begin() is only valid if the object on which it is called on has the member. But if manager->begin() is not dereferenced, it stays an iterator. Does that make sense?

max-veit commented 5 years ago

Ok, thanks. Sorry for the confusion, but I think this issue is not actually what you are referencing above ((manager.begin())->begin(), where I naïvely expected the -> operator to do the same thing as (*it). -- but no, apparently I was expecting too much of c++ semantics). I'll convert the radial gradient test to this more compact version, but I'm afraid this doesn't solve the issue with the original implementation of swap_pair_key() (exact version and line referenced above)

felixmusil commented 5 years ago

Your example Markus with pair2 is what we expected should work. As you say it compiles and run but we found out that the type of the underlying manager gets somehow lost to a more mundane StructureManagerBase in a random fashion. So the hand made looping using:

for (auto ii{0}; ii < some_value; ii++) {
  auto && pair2{*it_atom};
  ... some stuff ...
  it_atom++;
}

is actually buggy (seg fault) in an unpredictable way because pair2would lookup for atom_tags in a table filled with random values.

mastricker commented 5 years ago

I dug a little bit. And I am utterly confused. gcc Release: segfault gcc RelWithDebug: segfault gcc Debug: fine

I am now running a valgrind and see what that will turn out. My feeling is that references are missing. You are/were passing around auto types but largely not with a &/&&. I will be back. The segfault seems to come from an uninitialized value. But that should lead to a std::range_error.

max-veit commented 5 years ago

Oh nooooo, another Heisenbug... haven't we had enough of those already?

And you've tried this in the debugger, right? To see when/where the reference or type gets scribbled over?

mastricker commented 5 years ago

Ok, so the current status is the following: Everything seems to work as expected, dereference and everything, but this line makes the control flow go over it.

      while ((new_center != structure_manager->end()) &&
             ((*new_center).get_atom_tag() != pair_key.get_atom_tag())) {

I changed that to the following

while ((it != end)) {
        if ((*it).get_atom_tag() == pair_key.get_atom_tag()) {
          break;
        } else {
          ++it;
        }
      }

I am currently in the state where I get the programmer std::range_error, not a segmentation fault. And I can follow in the debugger now. So at least partially the error seems to lie in the control flow. I'll keep you updated.

mastricker commented 5 years ago

Ok, I am stopping here for now. My error is now in an access to block sparse property:

Program received signal SIGSEGV, Segmentation fault.
0x000055555602167e in std::_Rb_tree<std::vector<int, std::allocator<int> >, std::pair<std::vector<int, std::allocator<int> > const, std::tuple<int, int, int> >, std::_Select1st<std::pair<std::vector<int, std::allocator<int> > const, std::tuple<int, int, int> > >, std::less<std::vector<int, std::allocator<int> > >, std::allocator<std::pair<std::vector<int, std::allocator<int> > const, std::tuple<int, int, int> > > >::_M_begin (this=0x12216db5fc0af610) at /usr/include/c++/8/bits/stl_tree.h:753

It is this spot:

      //! access or insert specified element
      reference operator[](const SortedKey_t & skey) {
        auto & pos{this->map[skey.get_key()]};
        return reference(&this->data[std::get<0>(pos)], std::get<1>(pos),
                         std::get<2>(pos));

I can get the rest to work with using different logic with the reference and dereference of the atom and pairs. But I am not sure what this error is relating to.

felixmusil commented 5 years ago

the key you are trying to access the property with has not been registered at the resize step. Either you forgot to resize with the proper list of keys or the provided key is not valid.

mastricker commented 5 years ago

I did not change anything except the logic of the search for the (i, j) to (j, i) swap.

felixmusil commented 5 years ago

my guess is that you (i,j) and (j,i) don't have the same species present in the neighbourhood but the keys used to access are the same

mastricker commented 5 years ago

So that would be an actual bug, but unrelated to the dereference usage of the manager/atom iterators?

ceriottm commented 5 years ago

related to #87

mastricker commented 5 years ago

I am lost here - is the issue #87 closed because it is solved or closed because it is the same problem as here, i.e. issue merge?

felixmusil commented 5 years ago

The issues were merged.

Le 4 oct. 2019 11:28, mastricker notifications@github.com a écrit :

I am lost here - is the issue #87https://github.com/cosmo-epfl/librascal/issues/87 closed because it is solved or closed because it is the same problem as here, i.e. issue merge?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/cosmo-epfl/librascal/issues/97?email_source=notifications&email_token=AELQ545TM4UYATMJAY3QFA3QM4EDVA5CNFSM4H2553H2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEALCYPQ#issuecomment-538324030, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AELQ543KYDBEOCJJRLP7PJTQM4EDVANCNFSM4H2553HQ.