ilpincy / argos3

A parallel, multi-engine simulator for heterogeneous swarm robotics
http://www.argos-sim.info/
264 stars 122 forks source link

Segfault from inside i965_dri.so when selecting box #40

Closed allsey87 closed 6 years ago

allsey87 commented 7 years ago

Hi,

Referring to gdbbacktrace.txt, it seems when I select a box in ARGoS I get a segfault from somewhere inside i965_dri.so (my intel graphics driver).

It is a bit strange as one of the last calls before the trace disappears inside graphics driver is to CQTOpenGLOperationDrawBoxNormal, which should always run on each frame. I would have expected the segfault to come after a CQTOpenGLOperationDrawBoxSelected.

I have tested this on a vanilla clone of this repo, checked out on the same date as this post. I am running Ubuntu Gnome 16.04 LTS, this is my uname -a output

Linux Fujitsu-Laptop 4.4.0-78-generic #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

For details about the driver in use see: module-info-i915.txt

ilpincy commented 7 years ago

This is a known issue. Currently I have no idea why it happens, and why only on this specific graphics card. I suspect that this is due to a funny interaction in systems that have dual cards, and that disabling one in the BIOS might help, but I haven't had the time to verify my suspicion.

allsey87 commented 7 years ago

Did this issue rear its head prior to switching to Qt5? I have no such problem on an older version (forked end of 2013) of ARGoS 3 with Qt4.

beltrame commented 7 years ago

I've always had this issue, and I started using ARGoS in 2015. I only have an i915 (no dual card).

allsey87 commented 7 years ago

Also, as far as I'm aware, I don't have a secondary graphics card. If you suggest a couple of commits of interest where this issue may have come from, I will build and test them to confirm whether or not the issue is present.

ilpincy commented 7 years ago

The switch to Qt5 did not change the ARGoS code in any particular way. Most of that was just refactoring to adapt to the new OpenGL class. If your system does not have a double card, then I don't really know where to look. I agree that it must be something with Qt5 though... The problem is that I don't have a card I can test on - I'll ask a student if I can borrow his laptop for a few hours. For the commit, run a git log on the folder of the OpenGL visualization and you'll see the entire history.

allsey87 commented 7 years ago

I was thinking along the lines that perhaps this has to do with a new version of QOpenGLWidget...

allsey87 commented 7 years ago

@beltrame so you have had this issue before @ilpincy moved to Qt5 at the beginning of 2017?

ilpincy commented 7 years ago

Is it possible to install a debug version of the driver, so we see where it explodes? Maybe that would shed some light.

allsey87 commented 7 years ago

Never tried anything like that before, and since I don't have a back up computer at the moment I'm not particularly keen. I just hacked qtopengl_box.cpp by adding c_visualization.DrawBoundingBox(c_entity.GetEmbodiedEntity()); after c_visualization.DrawEntity(c_entity.GetEmbodiedEntity()); to draw the bounding box without selecting it... works fine? although, as soon as I select, out comes the smoke...

As a further test, I completely commented out this block code out from qtopengl_widget.cpp:

if(m_sSelectionInfo.IsSelected) {
    glPushMatrix();
    CallEntityOperation<CQTOpenGLOperationDrawSelected, CQTOpenGLWidget, void>(*this, *vecEntities[m_sSelectionInfo.Index]);
    glPopMatrix();
}

Exact same behavior, draws the bounding box fine, but as soon as I select, out comes the smoke...

ilpincy commented 7 years ago

It might be my way of managing selection - maybe getting in and out of selection mode I confuse the driver. However, the code is pretty straightforward and I followed an existing example almost verbatim...

ilpincy commented 7 years ago

Can you try playing with the code here: https://github.com/ilpincy/argos3/blob/master/src/plugins/simulator/visualizations/qt-opengl/qtopengl_widget.cpp#L364 ?

allsey87 commented 7 years ago

I'll have a look, just rebuilding with debugging symbols on... What is really interesting is I completely disabled both calls to the CQTOpenGLOperationDrawX entity operations. Same issue when I select, which suggests that the segfault occurring at that point was a coincidence and the fault actually came from a different thread? (I'm a bit (read completely) inexperienced with debugging multi-threaded programs in gdb).

Funny you should point out that code, that is exactly what I had my eye set on to play with next.

ilpincy commented 7 years ago

If you look at the backtrace, the segfault happens when ARGoS draws a robot with the normal model while in the method SelectInScene(). So the issue is really rendering in SELECT mode... is it possible that the driver does not support drawing something in that mode?

beltrame commented 7 years ago

@allsey87 Yes, I had the problem before Qt5, and on two different computers/distributions (Linux Mint and OpenSuSE, two flavours of Intel Graphics).

allsey87 commented 7 years ago

Ok, so by removing the calls to makeCurrent and doneCurrent in the SelectInScene method I am able to select and move around objects without segfaults. However, it is quite difficult to select things. It seems as if either (i) only some faces of the box in my example are selectable or (ii) there is a disagreement between between what is drawn in the select buffer and what is drawn on screen.

changes.txt shows a diff from the latest master, as you can see I haven't changed much.

allsey87 commented 7 years ago

For the sake of keeping a record of the testing, as @ilpincy suggested, this segfault occurs while drawing in select mode with the calls to makeCurrent and doneCurrent enabled. However, it isn't the drawing of a specific primitive, both GL_QUADS and GL_POINTS will fail at some point towards the end of a drawing function. The exact point at where the segfault occurs is difficult to locate, it seems to change depending on what is in the drawing function.

ilpincy commented 7 years ago

Great! The problem is that removing makeCurrent() and doneCurrent() does not work on Mac (the window becomes corrupted) nor in my Ubuntu 16.04 VirtualBox VM (same issue, but maybe it's because I run on Mac). I can solve it with conditional compilation, but I'd like to understand the issue better before proceeding.

allsey87 commented 7 years ago

Testing this on a new laptop with a both NVIDIA and Intel graphics. It seems that both graphics drivers are loaded (?), however, OpenGL is using the Intel graphics driver. This conclusion is based on the output of glxinfo | grep -i vendor which returns:

server glx vendor string: SGI
client glx vendor string: Mesa Project and SGI
    Vendor: Intel Open Source Technology Center (0x8086)
OpenGL vendor string: Intel Open Source Technology Center

As such, this laptop is also using the i915 driver and segfaults upon attempting to select an object. Removing the makeCurrent() and doneCurrent() function calls stops the segfaults, however, the selection mechanism still feels quite broken. Selecting object is difficult and can only be done from certain perspectives / by clicking on a subset of pixels, which represent only part of the object. I observed similar behavior on my other laptop.

allsey87 commented 7 years ago

@ilpincy I can also now install a debug driver on one of my laptops (the one that only has Intel graphics) and get more information. I have found that the following package is available for my system, however, I can't find much in the way of documentation regarding how to load and use it.

X.Org X server -- Intel i8xx, i9xx display driver (debug symbols)

This driver provides support for the Intel i8xx and i9xx family of chipsets, including i810, i815, i830, i845, i855, i865, i915, and i945 series chips.

This package provides debugging symbols for this Xorg X driver.

Thoughts? Perhaps I need to create the file /usr/share/X11/xorg.conf.d/20-intel.conf as described here and set the driver field to something like intel-dbg. In the example, the string intel is used although the driver name as reported by lshw and lsmod is i915

ilpincy commented 7 years ago

@allsey87 Thanks for offering your help with this issue. I think that adding debugging symbols would help shed some light. We could send a nice bug report to the driver developers once we understand what goes wrong.

I tried removing makeCurrent() ... doneCurrent() from the code, but it corrupts the graphics window on every computer I tried.

allsey87 commented 7 years ago

Out of interest, what do you observe when the graphics window is corrupted?

ilpincy commented 7 years ago

Half of the screen is black or covered by a random pattern.

cjcormier commented 7 years ago

@allsey87 I have this issue on my laptop as well, but when I select the robots from the Buzz Debugger it does not crash.

I have an test repository that you can use to replicate the process yourself.

Crash has the relevant tests. It causes a buzz runtime error so that the robots are selectable from the buzz debugger.

allsey87 commented 7 years ago

@ilpincy like this?

screenshot from 2017-10-06 10-35-40

This is with the makeCurrent() and doneCurrent() commented out. Somehow I only saw this for the first time today. The only major difference I can think of was that I was using the foot-bot and the dynamics2d engine for a simple demo...

allsey87 commented 7 years ago

I will make a note here that this black pattern only appears when I go to select something. Furthermore, I can select objects in the simulation, however, everything appears to be offset by a constant value. That is, I can find X,Y coordinates on the screen that correspond to where the object is drawn with respect to the selection buffer. It seems that the selection buffer and drawing buffer are just misaligned...

allsey87 commented 6 years ago

I'm going to allocate some time to solving this bug once and for all. I will document the steps that I have taken and would really appreciate any feedback or comments.

I am starting with a clean / up to date version ARGoS without any of my extensions. My first step was to get messages out of OpenGL to learn more about the nature of the segfault. My approach was to create and connect an instance of QOpenGLDebugLogger to the ARGoS Log. I did this by placing the following code at the bottom of void CQTOpenGLWidget::initializeGL()

if(m_pcOpenGLLogger == nullptr) {
   m_pcOpenGLLogger = new QOpenGLDebugLogger(this);
   if(!m_pcOpenGLLogger->initialize()) {
      LOGERR << "Could not initialize QOpenGLDebugLogger" << std::endl;
      delete m_pcOpenGLLogger;
      m_pcOpenGLLogger = nullptr;
   }
   else {
      LOGERR << "Initialized QOpenGLDebugLogger" << std::endl;
      connect(m_pcOpenGLLogger,
              &QOpenGLDebugLogger::messageLogged,
              [=](const QOpenGLDebugMessage& c_message) {
                  if(c_message.severity() == QOpenGLDebugMessage::HighSeverity) {
                     LOGERR << "[WARNING] " + c_message.message().toStdString() << std::endl;
                  }
                  else {
                     LOG << "[INFO] " + c_message.message().toStdString() << std::endl;
                  }
              }
      );
      m_pcOpenGLLogger->startLogging(QOpenGLDebugLogger::SynchronousLogging);
   }
}

This initialized correctly and outputed messages to the ARGoS log, note that I disabled redirecting the logs to the GUI by commenting out the lines m_pcLogStream = new CQTOpenGLLogStream(LOG.GetStream(), m_pcDockLogBuffer); and m_pcLogErrStream = new CQTOpenGLLogStream(LOGERR.GetStream(), m_pcDockLogErrBuffer); in qtopengl_main_window.cpp. Unfortunately, the output from this is quite boring and doesn't seem to show anything of interest. When I select an entity, there is no output prior to the segfault as shown in QOpenGLDebugLogger.txt.

My second attempt was using a debugging tool called RenderDoc, this tool seems very powerful and straightforward to use. I was able to start ARGoS, however, no valid API was detected. This was because RenderDoc only supports OpenGL 3.2+ core profile and it seems that although I have OpenGL 4.5 available on my machine, the current configuration of OpenGL, Qt, and ARGoS uses 3.0, or at least that is what is reported by the following code:

GLint major; GLint minor;
glGetIntegerv(GL_MAJOR_VERSION, &major);
glGetIntegerv(GL_MINOR_VERSION, &minor);
LOG << "OpenGL version " << major << "." << minor << std::endl;

Note that the output of glxinfo | grep OpenGL on my machine is as follows:

OpenGL vendor string: Intel Open Source Technology Center OpenGL renderer string: Mesa DRI Intel(R) Haswell Mobile OpenGL core profile version string: 4.5 (Core Profile) Mesa 17.0.7 OpenGL core profile shading language version string: 4.50 OpenGL core profile context flags: (none) OpenGL core profile profile mask: core profile OpenGL core profile extensions: OpenGL version string: 3.0 Mesa 17.0.7 OpenGL shading language version string: 1.30 OpenGL context flags: (none) OpenGL extensions: OpenGL ES profile version string: OpenGL ES 3.1 Mesa 17.0.7 OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.10 OpenGL ES profile extensions:

This aside, I was able to force OpenGL, Qt, and ARGoS to use version 4.5 as reported by the same code that originally reported 3.0 by adding the following lines of code before the line m_pcMainWindow = new CQTOpenGLMainWindow(m_tConfTree); in qtopengl_render.cpp

QSurfaceFormat format;
format.setDepthBufferSize(24);
format.setStencilBufferSize(8);
format.setVersion(3, 2);
format.setProfile(QSurfaceFormat::CoreProfile);
QSurfaceFormat::setDefaultFormat(format);

This recompiles fine and now connects properly to RenderDoc, however, the QtOpenGLWidget just displays the background color and nothing else. So at this point, I have a couple questions:

  1. Which version and profile is the QtOpenGL plugin designed to use
  2. Is the QtOpenGL plugin not compatible with OpenGL 3.2+, e.g. are we using some of the deprecated APIs?

Any thoughts @ilpincy?

ilpincy commented 6 years ago

Before giving up on the QOpenGLDebugLogger, I would try to use printf() rather than LOG. LOG is buffered per thread and ARGoS explicitly has to call LOG.Flush() to print anything. It is normal that you get no output prior to the crash. Instead, if you use printf() and make sure to add a endline after each message, you'll see the messages for sure.

As for the second question, I don't know exactly what to say, since it's Qt code.

allsey87 commented 6 years ago

Thanks @ilpincy, I will have another go using fprintf(stderr,...) to out the errors since I think printf also has buffering.

allsey87 commented 6 years ago

Since RenderDoc wasn't working with the older version of OpenGL and since the OpenGL widget didn't render when I requested Qt to use a later version I attempted to use another OpenGL debugging tool called BuGLe, this tool hasn't been developed for a while but I was able to get it working with a few tweaks to the source code.

There were two filter sets that I tried, the first created a log that traced all the OpenGL calls. I started ARGoS, attempted to select an e-puck, got a segfault, and a 172MB log file was produced! I had a look at this file using combinations of grep -C 10 -ni to search strings such as err or warn and found nothing of interest. Tail also didn't show anything of interest at the end of the log file. If anyone wants this file, let me know and I will put up on dropbox. The file is basically just a trace of all the OpenGL functions called e.g.

940-[INFO] trace.call: glGetStringi(GL_EXTENSIONS, 227) = "GL_SGIS_texture_border_clamp" 941-[INFO] trace.call: glGetStringi(GL_EXTENSIONS, 228) = "GL_SGIS_texture_edge_clamp" 942-[INFO] trace.call: glGetStringi(GL_EXTENSIONS, 229) = "GL_SGIS_texture_lod" 943-[INFO] trace.call: glGetStringi(GL_EXTENSIONS, 230) = "GL_SUN_multi_draw_arrays" 944-[INFO] trace.call: glGetBooleanv(GL_FRAMEBUFFER_SRGB_CAPABLE_EXT, 0x7ffc07ff4ca0 -> GL_FALSE) 945:[INFO] trace.call: glGetError() = GL_NO_ERROR 946-[INFO] trace.call: glXGetProcAddressARB("glGetStringi") = 0x7fd2e37e1bc0 947-[INFO] trace.call: glGetIntegerv(GL_NUM_EXTENSIONS, 0x7ffc07ff4b48 -> 231) 948-[INFO] trace.call: glGetStringi(GL_EXTENSIONS, 0) = "GL_3DFX_texture_compression_FXT1" 949-[INFO] trace.call: glGetStringi(GL_EXTENSIONS, 1) = "GL_AMD_conservative_depth" 950-[INFO] trace.call: glGetStringi(GL_EXTENSIONS, 2) = "GL_AMD_draw_buffers_blend"

The other method involves using BuGLe with gdb. The idea is to recover backtraces from segmentation faults inside the driver, even if the driver is compiled without symbols. I ran this a couple times and got the following backtraces: 1, 2, 3, 4. Again, nothing too interesting here I think, other than the crash seems to point to either glCallList() from argos::CQTOpenGLEPuck::Draw(argos::CEPuckEntity&) or from glPointSize from argos::CQTOpenGLWidget::DrawRays(argos::CControllableEntity&). On a side note, it seems that drawing the rays from the controllable entities in the selection buffer may be a minor bug.

allsey87 commented 6 years ago

This is the updated code for using QOpenGLDebugLogger, I added this code directly beneath the call to initializeOpenGLFunctions(); in void argos::CQTOpenGLWidget::initializeGL() putting this code any earlier causes a Qt assertion to fail.

GLint major; GLint minor;
glGetIntegerv(GL_MAJOR_VERSION, &major);
glGetIntegerv(GL_MINOR_VERSION, &minor);
::fprintf(stderr,"OpenGL version %d.%d\n", major, minor);
if(m_pcOpenGLLogger == nullptr) {
   m_pcOpenGLLogger = new QOpenGLDebugLogger(this);
   if(!m_pcOpenGLLogger->initialize()) {
      ::fprintf(stderr,"Could not initialize QOpenGLDebugLogger\n");
      delete m_pcOpenGLLogger;
      m_pcOpenGLLogger = nullptr;
   }
   else {
      ::fprintf(stderr,"Initialized QOpenGLDebugLogger\n");
      connect(m_pcOpenGLLogger,
              &QOpenGLDebugLogger::messageLogged,
              [=](const QOpenGLDebugMessage& c_message) {
                 ::fprintf(stderr,"%s\n",c_message.message().toStdString().c_str());
              }
      );
      m_pcOpenGLLogger->startLogging(QOpenGLDebugLogger::SynchronousLogging);
   }
}

I then proceeded to recompile, run ARGoS and select an epuck resulting in the segfault. A couple of additional lines appeared after I attempted to select and are shown in output.txt. The four messages and the segfault message after the blank line occurred following the segfault.

allsey87 commented 6 years ago

Moving forwards, I think the next step is to install the debugging symbols for graphics card such that I can get a closer look at the source of the segfault.

Thread 1 "argos3" received signal SIGSEGV, Segmentation fault. 0x00007fffc6886c07 in ?? () from /usr/lib/x86_64-linux-gnu/dri/i965_dri.so

allsey87@ThinkPad-T540p:~/Workspace/argos4$ ll /usr/lib/x86_64-linux-gnu/dri/i965_dri.so -rw-r--r-- 5 root root 7405120 Jun 8 09:54 /usr/lib/x86_64-linux-gnu/dri/i965_dri.so

dpkg shows the following:

allsey87@ThinkPad-T540p:~/Workspace/argos4$ dpkg -S /usr/lib/x86_64-linux-gnu/dri/i965_dri.so libgl1-mesa-dri:amd64: /usr/lib/x86_64-linux-gnu/dri/i965_dri.so

However, I am unable to continue at this point as I am not sure how to find the corresponding debug symbols package for libgl1-mesa-dri:amd64. Any suggestions @ilpincy?

ilpincy commented 6 years ago

Thanks for all the work! You basically got to the same point I got stuck, too. :(

I think Ubuntu should have a debug version of the mesa package, which installs the symbols you need to debug. Another possibility is to recompile the driver with debugging symbols on. Being a kernel driver I haven't dared to adventure beyond this point.

Having tried the code on several NVIDIA and Intel cards, and across platform, I really think the Intel driver has a bug. If not, something bad would happen in other cards too. Not a crash, maybe, but at least some sort of error state. When I tried on other computers, though, I never found anything.

allsey87 commented 6 years ago

The main issue I have with your conclusion that the Intel driver has a bug is that the problem disappears with Qt4. Just moments ago I confirmed this by installing Qt4 on my laptop (which removed Qt5) and building ARGoS based on commit https://github.com/ilpincy/argos3/commit/627ce753ee74d85bf77d16cd4fbd049dde32bb5f, the parent of https://github.com/ilpincy/argos3/commit/cc658433552314678513efd9e77e5989b85a66a6 where you added Qt5 for the first time. In this version based on Qt4, I am able to select as many epucks as I want!

I am going to have a go at building mesa driver now using the instructions over at 01.org. Perhaps this will get us closer to the source. I suspect that the QOpenGLWidget in Qt5 is using some extension that is not supported by the mesa implementation.

beltrame commented 6 years ago

Michael Allwright writes:

https://github.com/ilpincy/argos3/commit/cc658433552314678513efd9e77e5989b85a66a6 where you added Qt5 for the first time. In this version based on Qt4, I am able to select as many epucks as I want!

I made a test and I confirm it works without a hitch with this version. So, something happened with Qt5?

Regards,

Giovanni Beltrame, PhD, ing. MIST Lab - mistlab.ca Ecole Polytechnique de Montreal Visiting Professor - University of Tübingen

allsey87 commented 6 years ago

Just other results from testing after reinstalling Qt5:

  1. https://github.com/ilpincy/argos3/commit/cc658433552314678513efd9e77e5989b85a66a6 segfaults at launch, however works inside GDB - although I am not able to select any epucks (note that for these tests in the last couple days I am always using the test_epuck_lua.argos experiment)
  2. https://github.com/ilpincy/argos3/commit/48285750db4fe3c8e4b98d928136f242b2833877 is similar to https://github.com/ilpincy/argos3/commit/cc658433552314678513efd9e77e5989b85a66a6 in that a standard launch segfaults, however, launching inside GDB works. Although this time when I try to select an epuck I get the infamous segfault from inside i965_dri.so.
allsey87 commented 6 years ago

Steps taken to get a copy of i965_dri.so with debug symbols:

  1. i965_dri.so is part of Mesa, the 3D drivers for the Intel Graphics Stack
  2. glxinfo | grep Mesa reports that I am using Mesa 17.0.7 on my system, so I am going to download the same version from ftp://ftp.freedesktop.org/pub/mesa/mesa-17.0.7.tar.gz
  3. I built the package using these commands, the configure scripted failed a couple times on missing packages which I eventually installed and after which was ready to go.
    ./configure --prefix=/usr --enable-driglx-direct --enable-gles1 --enable-gles2 --enable-glx-tls --with-dri-driverdir=/usr/lib/dri --with-egl-platforms='drm x11' --with-dri-drivers=i965 --without-gallium-drivers --enable-debug
    make
  4. This successfully built a new, much bigger, i965_dri.so which I assume contains debugging symbols. I have backed up my existing i965_dri.so and swapped in the new version. Time for a reboot, possibly followed by booting linux from a recovery USB :D
allsey87 commented 6 years ago

And it worked! Not only did my laptop boot again, it also has given me a nice detailed backtrace from the segfault after selecting an e-puck in ARGoS. Here are the backtraces from a couple different runs:

backtrace.1.txt, backtrace.2.txt, backtrace.3.txt, backtrace.4.txt, backtrace.5.txt.

Looking closer at the code at intel_mipmap_tree.c:2425 we have:

if (src->stencil_mt) {
   brw_blorp_blit_miptrees(brw,
                           src->stencil_mt, 0 /* level */, 0 /* layer */,
                           src->stencil_mt->format, SWIZZLE_XYZW,
                           dst->stencil_mt, 0 /* level */, 0 /* layer */,
                           dst->stencil_mt->format,
                           0, 0,
                           src->logical_width0, src->logical_height0,
                           0, 0,
                           dst->logical_width0, dst->logical_height0,
                           GL_NEAREST, false, false /*mirror x, y*/,
                           false, false /* decode/encode srgb */);
}

the issue is dst->stencil_mt is NULL for some reason, so trying to dereference and access the format field is the source of our segfault. I inserted a breakpoint at intel_mipmap_tree.c:2425 and can confirm that this code path is only executed when trying to select something.

ilpincy commented 6 years ago

That is fantastic work Michael! Thank you so much!

allsey87 commented 6 years ago

No problem! Let's get this bug solved!

Focusing on what happens after entering SelectInScene: the main difference between the old version of ARGoS based on Qt4 (https://github.com/ilpincy/argos3/commit/627ce753ee74d85bf77d16cd4fbd049dde32bb5f) and the newer version of ARGoS based on Qt5, with respect to the driver, is when we select something we end up in the following block of code:

void
intel_miptree_updownsample(struct brw_context *brw,
                           struct intel_mipmap_tree *src,
                           struct intel_mipmap_tree *dst)
{
   brw_blorp_blit_miptrees(brw,
                           src, 0 /* level */, 0 /* layer */,
                           src->format, SWIZZLE_XYZW,
                           dst, 0 /* level */, 0 /* layer */, dst->format,
                           0, 0,
                           src->logical_width0, src->logical_height0,
                           0, 0,
                           dst->logical_width0, dst->logical_height0,
                           GL_NEAREST, false, false /*mirror x, y*/,
                           false, false);

   if (src->stencil_mt) {
      brw_blorp_blit_miptrees(brw,
                              src->stencil_mt, 0 /* level */, 0 /* layer */,
                              src->stencil_mt->format, SWIZZLE_XYZW,
                              dst->stencil_mt, 0 /* level */, 0 /* layer */,
                              dst->stencil_mt->format,
                              0, 0,
                              src->logical_width0, src->logical_height0,
                              0, 0,
                              dst->logical_width0, dst->logical_height0,
                              GL_NEAREST, false, false /*mirror x, y*/,
                              false, false /* decode/encode srgb */);
   }
}

The difference between the new and old versions of ARGoS / Qt is that for the older version both src->stencil_mt and dst->stencil_mt were NULL so the second call to brw_blorp_blit_miptrees was never made due to the if statement and dst->stencil_mt was never defeferenced...

allsey87 commented 6 years ago

@ilpincy, I think at this point it would be useful to have a discussion about how you have configured the QOpenGLWidget for ARGoS. In particular, I would like to know which version and profile of OpenGL are we targeting with ARGoS. When I try to configure ARGoS to use OpenGL 4.3 / core profile, I get numerous warnings/errors about unsupported extensions, deprecated API etc:

void CQTOpenGLMainWindow::CreateOpenGLWidget(TConfigurationNode& t_tree) {
   /* Create the surface format */
   QSurfaceFormat cFormat = QSurfaceFormat::defaultFormat();     
   cFormat.setDepthBufferSize(24);
   cFormat.setMajorVersion(4);
   cFormat.setMinorVersion(3);
   cFormat.setSamples(4);
   cFormat.setProfile(QSurfaceFormat::CoreProfile);
   /* Create the widget */
   QWidget* pcPlaceHolder = new QWidget(this);
   m_pcOpenGLWidget = new CQTOpenGLWidget(pcPlaceHolder, *this, *m_pcUserFunctions);
   m_pcOpenGLWidget->setFormat(cFormat);
   ...
   }
}
Mesa: 6670 similar GL_INVALID_OPERATION errors
Mesa: User error: GL_INVALID_ENUM in glDisable(GL_LIGHTING)
GL_INVALID_ENUM in glDisable(GL_LIGHTING)
Mesa: User error: GL_INVALID_OPERATION in unsupported function called (unsupported extension or deprecated function?)
GL_INVALID_OPERATION in unsupported function called (unsupported extension or deprecated function?)
GL_INVALID_OPERATION in unsupported function called (unsupported extension or deprecated function?)
GL_INVALID_OPERATION in unsupported function called (unsupported extension or deprecated function?)
GL_INVALID_OPERATION in unsupported function called (unsupported extension or deprecated function?)
GL_INVALID_OPERATION in unsupported function called (unsupported extension or deprecated function?)
Mesa: 4 similar GL_INVALID_OPERATION errors
Mesa: User error: GL_INVALID_ENUM in glEnable(GL_LIGHTING)

So at the moment, I suspect this issue is due to a difference in the OpenGL versions and profiles that are being used by different drivers, since in the code we are not explicitly specifying what we want to use¹. The fact that this works with NVIDIA drivers² or under OS X, could be that the default version / profile of OpenGL and the default flags and values in QSurfaceFormat as returned/set by the driver just happen to work and that this is not sufficient for the Intel drivers / Mesa implementation of OpenGL.

I think solving this bug could be as simple as explicitly specifying the correct profile / version of OpenGL in the code.

¹ This idea is based on this post: https://forum.qt.io/post/229442 ² Out of interest, when you say this works with NVIDIA graphics cards, are referring to using the open source Nouveau driver (which uses Mesa) or the official closed sourced driver from NVIDIA?

ilpincy commented 6 years ago

I think you might be on to something! When I say that it works, I mean that

I don't have a specific version of OpenGL in mind. Ideally, it would be the version that corresponds to the minimum code editing. :-)

As for meeting: I am currently travelling, and will be back on November 16th. Maybe we can setup a meeting on Thursday at 10am EST (4pm in Brussels)?

Thanks again for all the work you're doing. As I can't reproduce this bug on my own, what you're doing is truly valuable.

allsey87 commented 6 years ago

As for meeting: I am currently travelling, and will be back on November 16th. Maybe we can setup a meeting on Thursday at 10am EST (4pm in Brussels)?

This works for me.

I don't have a specific version of OpenGL in mind. Ideally, it would be the version that corresponds to the minimum code editing. :-)

I'll look into which API calls are resulting in the GL_INVALID_OPERATION messages. I think we should aim to support OpenGL 4.5 since it was released in 2014 and should have good support assuming reasonably up to date graphics drivers.

allsey87 commented 6 years ago

@ilpincy these are a couple articles I have been looking at. I think although the qtopengl visualisation is partially working, I have a hunch (unless I am missing something) that the way OpenGL is being initialized at the moment is flawed or at least only valid on OS X.

  1. Getting Started
  2. Load OpenGL Functions
  3. OpenGL in Qt 5.1

In fact, there is a strong recommendation in the qt docs that says we shouldn't do, what it seems like we are doing:

When making OpenGL function calls, it is strongly recommended to avoid calling the functions directly. Instead, prefer using QOpenGLFunctions (when making portable applications) or the versioned variants (for example, QOpenGLFunctions_3_2_Core and similar, when targeting modern, desktop-only OpenGL). This way the application will work correctly in all Qt build configurations, including the ones that perform dynamic OpenGL implementation loading which means applications are not directly linking to an GL implementation and thus direct function calls are not feasible.

allsey87 commented 6 years ago

Actually, searching on this page for functions like glPushMatrix, glMaterialfv etc we will definitely have to use one of the compatibility contexts of OpenGL, if we want to avoid completely rewriting the visualization plugin.

EDIT: After further reading and considering the implementation and API constraints, we basically can only support OpenGL 2.1

allsey87 commented 6 years ago

@ilpincy using a really nice program called apitrace I was able to get the exact state of OpenGL on the call to glSelectBuffer right before glRenderMode(SELECT) on both the Qt4 and Qt5 versions of ARGoS. This is the output of diffing the two states using the command apitrace diff-state argos3-qt4-2568.json argos3-qt5-5176.json

{
  framebuffer: {
  },
  parameters: {
    GL_BACK: {
      GL_SHININESS: 100 -> 0,
    },
    GL_COLOR_CLEAR_VALUE: [
      0,
      0.5019608 -> 0.5,
      0.5019608 -> 0.5,
      1
    ],
    GL_CURRENT_COLOR: [
      0 -> 1,
      0,
      0,
      1
    ],
    GL_DOUBLEBUFFER: "GL_TRUE" -> "GL_FALSE",
    GL_DRAW_BUFFER: "GL_BACK" -> "GL_COLOR_ATTACHMENT0",
    GL_DRAW_BUFFER0: "GL_BACK" -> "GL_COLOR_ATTACHMENT0",
    GL_DRAW_FRAMEBUFFER: null -> {
      GL_COLOR_ATTACHMENT0: {
        GL_FRAMEBUFFER_ATTACHMENT_ALPHA_SIZE: 8,
        GL_FRAMEBUFFER_ATTACHMENT_BLUE_SIZE: 8,
        GL_FRAMEBUFFER_ATTACHMENT_COLOR_ENCODING: "GL_LINEAR",
        GL_FRAMEBUFFER_ATTACHMENT_COMPONENT_TYPE: "GL_UNSIGNED_NORMALIZED",
        GL_FRAMEBUFFER_ATTACHMENT_DEPTH_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_GREEN_SIZE: 8,
        GL_FRAMEBUFFER_ATTACHMENT_OBJECT_NAME: 1,
        GL_FRAMEBUFFER_ATTACHMENT_OBJECT_TYPE: "GL_RENDERBUFFER",
        GL_FRAMEBUFFER_ATTACHMENT_RED_SIZE: 8,
        GL_FRAMEBUFFER_ATTACHMENT_STENCIL_SIZE: 0
      },
      GL_DEPTH_ATTACHMENT: {
        GL_FRAMEBUFFER_ATTACHMENT_ALPHA_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_BLUE_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_COLOR_ENCODING: "GL_LINEAR",
        GL_FRAMEBUFFER_ATTACHMENT_COMPONENT_TYPE: "GL_UNSIGNED_NORMALIZED",
        GL_FRAMEBUFFER_ATTACHMENT_DEPTH_SIZE: 24,
        GL_FRAMEBUFFER_ATTACHMENT_GREEN_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_OBJECT_NAME: 2,
        GL_FRAMEBUFFER_ATTACHMENT_OBJECT_TYPE: "GL_RENDERBUFFER",
        GL_FRAMEBUFFER_ATTACHMENT_RED_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_STENCIL_SIZE: 8
      },
      GL_STENCIL_ATTACHMENT: {
        GL_FRAMEBUFFER_ATTACHMENT_ALPHA_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_BLUE_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_COLOR_ENCODING: "GL_LINEAR",
        GL_FRAMEBUFFER_ATTACHMENT_COMPONENT_TYPE: "GL_UNSIGNED_NORMALIZED",
        GL_FRAMEBUFFER_ATTACHMENT_DEPTH_SIZE: 24,
        GL_FRAMEBUFFER_ATTACHMENT_GREEN_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_OBJECT_NAME: 2,
        GL_FRAMEBUFFER_ATTACHMENT_OBJECT_TYPE: "GL_RENDERBUFFER",
        GL_FRAMEBUFFER_ATTACHMENT_RED_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_STENCIL_SIZE: 8
      }
    },
    GL_DRAW_FRAMEBUFFER_BINDING: 0 -> 1,
    GL_FRAMEBUFFER_SRGB_CAPABLE_EXT: "GL_TRUE" -> "GL_FALSE",
    GL_FRONT: {
      GL_SHININESS: 100 -> 0,
    },
    GL_GENERATE_MIPMAP_HINT: "GL_NICEST" -> "GL_DONT_CARE",
    GL_LIGHT0: {
      GL_AMBIENT: [
        0.1 -> 0.2,
        0.1 -> 0.2,
        0.1 -> 0.2,
        1
      ],
      GL_DIFFUSE: [
        0.6 -> 0.8,
        0.6 -> 0.8,
        0.6 -> 0.8,
        1
      ],
      GL_POSITION: [
        49.00039 -> 50,
        13.95015 -> 50,
        -49.57487 -> 2,
        1
      ],
    },
    GL_LIGHT1: {
      GL_AMBIENT: [
        0.1,
        0.1,
        0.1,
        1
      ],
      GL_CONSTANT_ATTENUATION: 1,
      GL_DIFFUSE: [
        0.6,
        0.6,
        0.6,
        1
      ],
      GL_LINEAR_ATTENUATION: 0,
      GL_POSITION: [
        -49.01999,
        -10.31277,
        49.43683,
        1
      ],
      GL_QUADRATIC_ATTENUATION: 0,
      GL_SPECULAR: [
        0,
        0,
        0,
        1
      ],
      GL_SPOT_CUTOFF: 180,
      GL_SPOT_DIRECTION: [
        0,
        0,
        -1
      ],
      GL_SPOT_EXPONENT: 0
    } -> null,
    GL_READ_BUFFER: "GL_BACK" -> "GL_COLOR_ATTACHMENT0",
    GL_READ_FRAMEBUFFER: null -> {
      GL_COLOR_ATTACHMENT0: {
        GL_FRAMEBUFFER_ATTACHMENT_ALPHA_SIZE: 8,
        GL_FRAMEBUFFER_ATTACHMENT_BLUE_SIZE: 8,
        GL_FRAMEBUFFER_ATTACHMENT_COLOR_ENCODING: "GL_LINEAR",
        GL_FRAMEBUFFER_ATTACHMENT_COMPONENT_TYPE: "GL_UNSIGNED_NORMALIZED",
        GL_FRAMEBUFFER_ATTACHMENT_DEPTH_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_GREEN_SIZE: 8,
        GL_FRAMEBUFFER_ATTACHMENT_OBJECT_NAME: 1,
        GL_FRAMEBUFFER_ATTACHMENT_OBJECT_TYPE: "GL_RENDERBUFFER",
        GL_FRAMEBUFFER_ATTACHMENT_RED_SIZE: 8,
        GL_FRAMEBUFFER_ATTACHMENT_STENCIL_SIZE: 0
      },
      GL_DEPTH_ATTACHMENT: {
        GL_FRAMEBUFFER_ATTACHMENT_ALPHA_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_BLUE_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_COLOR_ENCODING: "GL_LINEAR",
        GL_FRAMEBUFFER_ATTACHMENT_COMPONENT_TYPE: "GL_UNSIGNED_NORMALIZED",
        GL_FRAMEBUFFER_ATTACHMENT_DEPTH_SIZE: 24,
        GL_FRAMEBUFFER_ATTACHMENT_GREEN_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_OBJECT_NAME: 2,
        GL_FRAMEBUFFER_ATTACHMENT_OBJECT_TYPE: "GL_RENDERBUFFER",
        GL_FRAMEBUFFER_ATTACHMENT_RED_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_STENCIL_SIZE: 8
      },
      GL_STENCIL_ATTACHMENT: {
        GL_FRAMEBUFFER_ATTACHMENT_ALPHA_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_BLUE_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_COLOR_ENCODING: "GL_LINEAR",
        GL_FRAMEBUFFER_ATTACHMENT_COMPONENT_TYPE: "GL_UNSIGNED_NORMALIZED",
        GL_FRAMEBUFFER_ATTACHMENT_DEPTH_SIZE: 24,
        GL_FRAMEBUFFER_ATTACHMENT_GREEN_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_OBJECT_NAME: 2,
        GL_FRAMEBUFFER_ATTACHMENT_OBJECT_TYPE: "GL_RENDERBUFFER",
        GL_FRAMEBUFFER_ATTACHMENT_RED_SIZE: 0,
        GL_FRAMEBUFFER_ATTACHMENT_STENCIL_SIZE: 8
      }
    },
    GL_READ_FRAMEBUFFER_BINDING: 0 -> 1,
    GL_RENDERBUFFER_BINDING: 0 -> 2,
    GL_SAMPLES: 0 -> 4,
    GL_SAMPLE_BUFFERS: 0 -> 1,
    GL_SCISSOR_BOX: [
      0,
      0,
      320 -> 100,
      240 -> 100
    ],
    GL_SELECTION_BUFFER_POINTER: 94193257724264 -> 94176281598552,
    GL_TEXTURE0: {
      GL_TEXTURE_2D: {
        GL_GENERATE_MIPMAP: "GL_TRUE" -> "GL_FALSE",
        GL_TEXTURE_ALPHA_SIZE: 0 -> 8,
        GL_TEXTURE_ALPHA_TYPE: "GL_ZERO" -> "GL_UNSIGNED_NORMALIZED",
        GL_TEXTURE_IMMUTABLE_FORMAT: "GL_FALSE" -> "GL_TRUE",
        GL_TEXTURE_IMMUTABLE_LEVELS: 0 -> 10,
        GL_TEXTURE_INTERNAL_FORMAT: "GL_RGB" -> "GL_RGBA8",
        GL_TEXTURE_MAX_LEVEL: 1000 -> 9,
        GL_TEXTURE_VIEW_NUM_LAYERS: 0 -> 1,
        GL_TEXTURE_VIEW_NUM_LEVELS: 0 -> 10,
      },
      GL_TEXTURE_BINDING_2D: 145 -> 1,
    },
  },
}
allsey87 commented 6 years ago

So, based on my last post I and after digging a bit more deeply I have reached the following conclusions:

  1. Qt5 sets up OpenGL in way that is quite different to Qt4 and uses different combinations of buffers, attachments etc.
  2. The reason for the crash is that for some reason, these changes in the buffers are triggering a bug in the Intel driver
  3. The selection approach used in the QtOpenGL plugin, alongside the entire fix pipeline rendering approach has been deprecated since OpenGL 3.0 and was removed in OpenGL 3.1.

To this end, I will not fix the bug but rather propose an alternative means of selecting entities based on color-picking. I have put a rough draft of this code over in the following repo: https://github.com/allsey87/argos3-selection-proposal - although it is incomplete and there are some minor glitches, I believe this is already working quite well (no segfaults!) and I invite you to test it (using src/testing/experiment/test_selection.argos) and to give me feedback.

In a nutshell, this works by rendering everything without GL_LIGHTING inside a QOpenGLFramebufferObject using a color based on the entity's index in CSpace::GetRootEntityVector(). The following in a snapshot shows what the render into the selection framebuffer looks like. out For the moment, I've added a new drawing method called CQTOpenGLOperationDrawSilhouette although it may be possible to work around this using the existing draw method assuming all entities only use glMaterialfv for normal rendering, since glMaterialfv has no effect when GL_LIGHTING is disabled.

Some final notes: our visualization plugin is built upon a lot of deprecated functionality and uses the outdated fixed pipeline approach for rendering (which has been completely removed since OpenGL 3.1). I think issues like this may continue to appear as vendors test the old versions of OpenGL and fixed pipeline functionality less and less and focus more on the programmable pipeline approach. Furthermore, I think that programming at the OpenGL level is too low level and beyond the scope of ARGoS and is a waste of our time. At some point, I strongly feel we need to move to using either a 3rd party renderer like Horde3D or the modern approach of simply having ARGoS pipe everything to WebGL based renderer like Babylon.js. The latter would have interesting implications in a cluster based set up where the user could locally render and monitor various instances of ARGoS in their web browser.

ilpincy commented 6 years ago

Wow, thanks a lot for all this work! Really impressive! :-O

I do agree that we should move to a more modern visualization, especially one that allows for models to be imported in a simple way. I have wanted to do it since forever, but never found the time/help to make it happen.

I had a few cracks at Horde3D myself and it would be choice too - I even have some initial prototype code done.

I never though about the option of using WebGL, and it's a really good idea. Love it!

So, how do we proceed? I'd like to have this done. I'll try to find a good student willing to help. If you have any suggestions, I'm all ears.

allsey87 commented 6 years ago

@ilpincy, @beltrame, and @cjcormier: when you get a chance could test out my solution / work around in: https://github.com/allsey87/argos3-selection-proposal

If you could let me know whether this works and any additional info like OS, graphics card / driver, OpenGL version etc that would be very helpful. I can then finalise this work around, submit it, and close off this bug.

beltrame commented 6 years ago

Michael Allwright writes:

@ilpincy, @beltrame, and @cjcormier: when you get a chance could test out my solution / work around in: https://github.com/allsey87/argos3-selection-proposal If you could let me know whether this works and any additional info like OS, graphics card / driver, OpenGL version etc that would be very helpful. I can then finalise this work around, submit it, and close off this bug.

For me there's no crash, but I can't select robots either. I'm using OpenSuSE Tumbleweed, Intel HD 620 with i915 driver, GLX 1.4, OpenGL core 4.5.

Regards,

Giovanni Beltrame, PhD, ing. MIST Lab - mistlab.ca Ecole Polytechnique de Montreal Visiting Professor - University of Tübingen

allsey87 commented 6 years ago

@beltrame thanks for testing this, I have only implemented the selection for boxes at the moment. The test file is src/testing/experiment/test-selection.argos. are you able to select and move the boxes?