paesanilab / MBX

MBX is an energy and force calculator for data-driven many-body simulations.
Other
30 stars 32 forks source link

icpc fails to compile code when optimization is activated #18

Closed chemphys closed 7 years ago

chemphys commented 7 years ago

Hey @zonca ,

This repo compiles well with clang, g++ and intel without optimizzations. However, I run into an error when I optimize the code with optimization (-O1, -O2 or -O3).

icpc: error #10106: Fatal error in /data/software/repo/intel/2017.0.098/compilers_and_libraries_2017.0.098/linux/bin/intel64/mcpcom, terminated by kill signal

Googling a little bit seems that is a memory error, but using the keyword -mcmodel=large also doesn't help. I tried multi-file optimization (-ipo) and it still fails. Any idea? @agoetz , @darcykimball , maybe you know how to fix this?

Thanks!

zonca commented 7 years ago

Hi, investigating this kind of weakly identified issues in a very large codebase requires a large amount of time. Sorry I don't have time available to do it.

On Sat, Sep 23, 2017, 19:37 Marc Riera Riambau notifications@github.com wrote:

Hey @zonca https://github.com/zonca ,

This repo compiles well with clang, g++ and intel without optimizzations. However, I run into an error when I optimize the code with optimization (-O1, -O2 or -O3).

icpc: error #10106: Fatal error in /data/software/repo/intel/2017.0.098/compilers_and_libraries_2017.0.098/linux/bin/intel64/mcpcom, terminated by kill signal

Googling a little bit seems that is a memory error, but using the keyword -mcmodel=large also doesn't help. I tried multi-file optimization (-ipo) and it still fails. Any idea? @agoetz https://github.com/agoetz , @darcykimball https://github.com/darcykimball , maybe you know how to fix this?

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paesanilab/clusters_ultimate/issues/18, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXYctlyTjqOXRg5vUy23At-PMR9I5utks5slUG8gaJpZM4Phn70 .

darcykimball commented 7 years ago

Is it a segfault (SIGSEGV)? Can you get a stacktrace?

chemphys commented 7 years ago

Is in the compilation process. The error I show in the previous message is everything that appears. Is there any option in icpc to give a full compilation log? I cannot find anything else. How can I get the stacktrace when using make?

darcykimball commented 7 years ago

From googling, it seems that you can at least make it emit a report of optimizations made; check the man page or somewhere on https://software.intel.com/en-us/node/522791. If something went wrong during optimization the report should give some indication; if the compiler simply ran out of memory, that should be noticeable too (hopefully).

If the compiler didn't spit out a stacktrace, then you maybe just can't get one; this is proprietary software after all, I guess. Clang, for example, spits out a full stacktrace + scripts for submitting a bug report, etc. when a segfault happens. Have you tried checking with Intel customer support? This is the sort of thing where if it's a compiler bug, you'll have to ask about it. For example, if the optimizer is taking way too much memory, it might just be crappy; there's no way for you to fix that.

It could be gnarly, in any case, so customer support forums would be the sanest option to take first.

chemphys commented 7 years ago

Thanks @darcykimball for the suggestion of splitting the function. Just a note. I pushed right now the polynomials for the 2b calculation as they will also be for the three-body. If you check the poly-2b-v6x.cpp in src/potential/2b/ , you will see that now the polynomials accept a set of dimers instead of a single dimer. The idea behind this is that then, since the operations are the same for all dimers, we can let icpc optimize and vectorize this part of the code, which is the most consuming one in the 2b evaluation. If we have the function, as you suggested:

foo(double a[1000]) {
  t1 = ...
  t2 = ...
  t3 = ...
  t4 = ...
  ...
  t10000 = ...
  return ...
}

becomes...

foo1(double a[1000], double t[10000]) {
  t[0] = ...
  ...
  t[4999] = ...
}

foo2(double a[1000], double t[10000]) {
  t[5000] = ...
  ...
  t[9999] = ...
}

real_foo(double a[1000]) {
  double t[10000];
  foo1(a, t);
  foo2(a, t);
  return ...hopeless expression with a bunch of t's...;
}

Will the compiler still be able to optimize them in a loop like:

foo(double a[1000], size_t n) {
for (size_t i =0; i < n; i++) {
  t1 = ...
  t2 = ...
  t3 = ...
  t4 = ...
  ...
  t10000 = ...
}
  return ...
}

where this becomes

foo1(double a[1000], double t[10000]) {
  t[0] = ...
  ...
  t[4999] = ...
}

foo2(double a[1000], double t[10000]) {
  t[5000] = ...
  ...
  t[9999] = ...
}

real_foo(double a[1000], size_t n) {
  for (size_t i =0; i < n; i++) {
    double t[10000];
    foo1(a, t);
    foo2(a, t);
  }
  return ...hopeless expression with a bunch of t's...;
}

So, will that loop in foo be enough or we have to put the loop inside foo1 and foo2 ? You can check https://github.com/chemphys/clusters_ultimate/blob/master/src/potential/2b/poly-2b-v6x.cpp to see what I mean by the loops. Thanks!!

darcykimball commented 7 years ago

Couple things:

The way I split it up was arbitrary, since I just wanted to see if splitting it would make it compilable; I chose to pass around intermediate values (all those t# variables) as some array. This should change how things are optimized, e.g. unless you put restrict annotations on array arguments, the compiler won't be able to convince itself that each temporary doesn't change once initialized. There are crude (global vars) and not-as-crude (static one-time alloc, or class member) alternate ways to do this, I think.

If you want it to be optimized more or less the same as before, I'd reckon that you'd just to make sure that the compiler has the same amount of information at the point of optimization as if the function were in a single translation unit, e.g. that all temporaries are really one-time temporaries, pointers don't alias, etc. I'm not familiar with any icpc extensions and such but the info on how to control these things should be out there.

Finally, if I understand what you're asking about where to put loops: I think any decent compiler should be able to optimize what was written above (as is) by inlining foo1 and foo2, and vectorizing as it will. That is, it shouldn't matter which way you write it; statically, the compiler should be able to tell that all this stuff is being done over the same range of indices. First though, just to ask: have you guys checked how much (if at all) the compiler vectorizes operations for these polynomial functions? It'd possibly be counterproductive to write loops in such a way if the compiler doesn't end up leveraging it.

darcykimball commented 7 years ago

I used something like:

:1,$s/t([0-9]+)/t[\1]/g

in vim, which in English is "replace all occurrences of t followed by at least one digit, by t, [, those digits, and then ]". Sed or perl can do the same thing with about as much typing.

On Mon, Sep 25, 2017 at 3:02 PM, Marc Riera Riambau < notifications@github.com> wrote:

By the way, @darcykimball https://github.com/darcykimball , how do you replace all t??? by t[???] ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paesanilab/clusters_ultimate/issues/18#issuecomment-332026857, or mute the thread https://github.com/notifications/unsubscribe-auth/AMEb63G5xzu4MqgkgAj0shi3gBmDBCLSks5smCLrgaJpZM4Phn70 .

chemphys commented 7 years ago

Hey Kevin,

I have not done yet any check. I am working on it right now, but I wanted first to make the code compile with optimizations. I will check the optimization reports and see the timings and then we will see what is faster. I will put it here in github once I have everything in place.

And thanks for the command!