mctools / dgcode

DGCode: the coding framework of the ESS Detector Group. For more information, refer to https://confluence.esss.lu.se/display/DGCODE/CodingFramework
Other
2 stars 1 forks source link

seg fault in ess_griffanatests_testextract #3

Closed tkittel closed 8 months ago

tkittel commented 1 year ago

Seg-fault, only on ubuntu22 + clang and only in release build:

https://github.com/mctools/dgcode/actions/runs/3327354892/jobs/5502075125

====>
427
====> Last 20 lines from ess_griffanatests_testextract/output.log:
428
====>
429
Extracted 1 events from input file into new file out.griff
430
GriffDataReader opened file 10evts_singleneutron_on_b10_reduced.griff
431
Copied event #0 to output file
432
Copied event #1 to output file
433
Copied event #2 to output file
434
Copied event #3 to output file
435
Copied event #4 to output file
436
Copied event #5 to output file
437
Copied event #6 to output file
438
Copied event #7 to output file
439
Copied event #8 to output file
440
Copied event #9 to output file
441
No more events requested from input file - ending.
442
Extracted 10 events from input file into new file out.griff
443
GriffDataReader opened file 10evts_singleneutron_on_b10_reduced.griff
444
Copied event #0 to output file
445
Copied event #7 to output file
446
No more events requested from input file - ending.
447
Extracted 2 events from input file into new file out.griff
448
/home/runner/work/dgcode/dgcode/projects/install/scripts/ess_griffanatests_testextract: line 11: 12994 Segmentation fault      (core dumped) ess_griffformat_dumpfile out.griff
449
====> (end of ess_griffanatests_testextract/output.log)
tkittel commented 1 year ago

Sigh, I am finding it impossible to reproduce this issue locally. Even using a conda env with clang14+py3.10 on my ubuntu 20 machine did not help.

MilanKlausz commented 1 year ago

I've created a test_debugging_with_tmate workflow that uses the Debugging with tmate GH action, which allows connecting to github runner through ssh or a web sell:

Note: press q in the blank web shell (the message that appears in the terminal through ssh doesn't appear in the web shell)

tkittel commented 1 year ago

Very cool @MilanKlausz !! So that already tells me that the seg fault happens in:

python -c 'import GriffFormat.dumpfile'

But I have to run now unfortunately, I don't see anything obviously wrong...

tkittel commented 1 year ago

Trying locally to see if a new dgboost version (#18) would help. With that, I get a segfault in a different test, in GriffDataRead/pycpp__init/materials.hh. It appears related to the py_elem_iter and py_isotope_iter, which gets destructed more often than constructed. Too be investigated.

tkittel commented 1 year ago

Commit a22fada fixes a segfault I observed when testing out dgboost v1.80, but I am not actually sure that it alone will fix the problem in the present issue, so reopening.

MilanKlausz commented 1 year ago

Raises new issues. (Sorry, the validations test setups for ubuntu were left commented out)

For ubuntu + clang (e.g., ubuntu-20.04.clang-12.python-3.9.dgcode-debug)

In file included from /home/runner/work/dgcode/dgcode/dgcode_framework/packages/Framework/Griff/GriffDataRead/pycpp__init/mod.cc:11:
/home/runner/work/dgcode/dgcode/dgcode_framework/packages/Framework/Griff/GriffDataRead/pycpp__init/materials.hh:13:12: error: local variable 'l' will be copied despite being returned by name [-Werror,-Wreturn-std-move]
    return l;
           ^
/home/runner/work/dgcode/dgcode/dgcode_framework/packages/Framework/Griff/GriffDataRead/pycpp__init/materials.hh:13:12: note: call 'std::move' explicitly to avoid copying
    return l;
           ^
           std::move(l)
/home/runner/work/dgcode/dgcode/dgcode_framework/packages/Framework/Griff/GriffDataRead/pycpp__init/materials.hh:23:12: error: local variable 'l' will be copied despite being returned by name [-Werror,-Wreturn-std-move]
    return l;
           ^
/home/runner/work/dgcode/dgcode/dgcode_framework/packages/Framework/Griff/GriffDataRead/pycpp__init/materials.hh:23:12: note: call 'std::move' explicitly to avoid copying
    return l;
           ^
           std::move(l)
2 errors generated.
make: *** [/home/runner/work/dgcode/dgcode/projects/.bld/makefiles/Makefile_GriffDataRead.make:72: /home/runner/work/dgcode/dgcode/projects/.bld/pc/GriffDataRead/objs/pycpp__init/mod.o] Error 1
make: *** Waiting for unfinished jobs....
ERROR: Build problems encountered

For macos (e.g., macos-11.clang.python-default.dgcode-debug)

Package NCrystalExtra done
  Creating shared library for package DGBoost
clang: error: no such file or directory: 'LINKER:-rpath,/Library/Frameworks'
make: *** [/Users/runner/work/dgcode/dgcode/projects/.bld/named_targets/DGBoost_libsrc] Error 1
make: *** Waiting for unfinished jobs....
ERROR: Build problems encountered
tkittel commented 1 year ago

Sigh... did you check @MilanKlausz if this is easy to reproduce locally on a mac? In any case, I probably won't have much time until next week.

MilanKlausz commented 1 year ago

Locally I had the 'local variable 'l' ...' clang issue - that was solved by the use of the suggested std::move() -, but don't get the 'no such file or directory: 'LINKER:-rpath,/Library/Frameworks' error. It might be because I have AppleClang 12 (with macOS 10.15), and the Github macOS-11 and 12 runners have AppleClang 13 and 14. (There is an ubuntu-22.04 + clang-14 job that passes.)

The 'debugging with tmate' tool is now available in the dgcode_actions/run_validation_tests workflow through an input option #15, and this workflow also uses a dynamic matrix #16, so it is easy to start a debugging session on a problematic macos runner. (The 'Do not install Geant4' option should be used, as we don't have working caches for the macos runners at the moment) Screenshot 2022-11-09 at 0 59 55

MilanKlausz commented 1 year ago

About the 'clang: error: no such file or directory: 'LINKER:-rpath,/Library/Frameworks'' issue: The LINKER:-rpath,/Library/Frameworks part is added to the "clang++ ..." command through the DG_GLOBAL_LINK_FLAGS in the .system/cmakedetect/ExtDep_Python.cmake file

if ( Python3_LINK_OPTIONS )
    set(DG_GLOBAL_LINK_FLAGS "${DG_GLOBAL_LINK_FLAGS} ${Python3_LINK_OPTIONS}")
endif()

I found that, the “LINKER:,” syntax is for handling multiple compilers: "(…) “LINKER:” is replaced by the appropriate driver option and “,” by the appropriate driver separator. The driver prefix and driver separator are given by the values of the CMAKE_<LANG>_LINKER_WRAPPER_FLAG and CMAKE_<LANG>_LINKER_WRAPPER_FLAG_SEP variables. (...) For example, “LINKER:-z,defs” becomes -Xlinker -z -Xlinker defs for Clang and -Wl,-z,defs for GNU GCC.”

I have no idea when this Python3_LINK_OPTIONS ("Some configurations require specific link options for a correct build and execution.“) is used, but is it possible that it has always been empty before, and adding it to the DG_GLOBAL_LINK_FLAGS never really happened before? I guess the transformation of the "LINKER:," syntax in cmake happens at a later point then where we end using it.

(Note: the cmake version is the same 3.24.2 for macos-11, macos-12 and ubuntu 22.04)

tkittel commented 1 year ago

Thanks for debugging Milan. If you look at the very recent (Nov 1) change in a3836b3 you will see that I now for the first time start using the find_package (Python3 COMPONENTS Interpreter Development) method of locating Python. Until a few years ago it did not work very well, but I have wanted to migrate to it for some time. Moving to boost 1.80 made me need to do this migration now, so it is indeed all related.

Now, the question is how to fix it. I guess we have to create a small function which will expand the LINKER: parts as you describe. Would be great if there was already such a function in cmake, seems a bit annoying to have to do this manually.

tkittel commented 1 year ago

I think I managed to make a function which can decode this LINKER: syntax, which we can use. For reference it and a few test calls are included here. I will add it as a utility function in dgcode and use it in the ExtDep_Python.cmake file. It might be we should use it elsewhere as well, but for now let us see if this resolves the immediate issue:

cmake_minimum_required(VERSION 3.12.4...3.24.2)

project(DUMMY LANGUAGES C)

function( decode_link_options link_options language resvar )
  #Decodes the LINKER: syntax
  #(cf. https://cmake.org/cmake/help/latest/command/add_link_options.html)

  if( "${CMAKE_VERSION}" VERSION_LESS "3.13.0")
    #LINKER: syntax not introduced yet:
    set( ${resvar} "${link_options}" PARENT_SCOPE )
  endif()

  if ( link_options MATCHES "SHELL:" )
    message(FATAL_ERROR "\"SHELL:\" syntax encountered in link options. This is currently not supported.")
  endif()

  if ( NOT link_options MATCHES "LINKER:" )
    set( ${resvar} "${link_options}" PARENT_SCOPE )
    return()
  endif()
  string(REPLACE " " ";" link_opts_list "${link_options}")
  list(JOIN CMAKE_${language}_LINKER_WRAPPER_FLAG " " wrapflag )
  if ( wrapflag MATCHES "  $")
    string(REGEX REPLACE "  $" " " wrapflag "${wrapflag}")
  endif()
  #set( wrapflag "${CMAKE_${language}_LINKER_WRAPPER_FLAG}" )
  if ( DEFINED CMAKE_${language}_LINKER_WRAPPER_FLAG_SEP )
    set( has_wrapflagsep ON )
    set( wrapflagsep "${CMAKE_${language}_LINKER_WRAPPER_FLAG_SEP}" )
  else()
    set( has_wrapflagsep OFF )
  endif()

  set(tmp "")
  foreach( elem ${link_opts_list} )
    if (  elem MATCHES "^LINKER:" )
      string(SUBSTRING "${elem}" 7 -1 elem )
      if ( NOT elem MATCHES "," )
        list(APPEND tmp "${wrapflag}${elem}")
      else()
        if ( has_wrapflagsep )
          string(REPLACE "," "${wrapflagsep}" elem "${elem}")
          list(APPEND tmp "${wrapflag}${elem}")
        else()
          string(REPLACE "," " ${wrapflag}" elem "${elem}")
          list(APPEND tmp "${wrapflag}${elem}")
        endif()
      endif()
    else()
      list(APPEND tmp "${elem}")
    endif()
  endforeach()
  list(JOIN tmp " " tmp)
  set( ${resvar} "${tmp}" PARENT_SCOPE )
endfunction()

function(test teststr)
  message("")
  message("IN : ${teststr}")
  decode_link_options( "${teststr}" C out )
  message("OUT: ${out}")
endfunction()

test("hejsa -a -b      -c")
test("hejsa -a -b -c")
test("hejsa -a -b   \n   -c")
test("LINKER:w bla")
test("--justaflag LINKER:-z,defs bla.so")

#gcc:
test("LINKER:-z,defs")

#clang:
message("------------------------")
message("Clang test (should give \"-Xlinker -z -Xlinker defs\")")
set (CMAKE_C_LINKER_WRAPPER_FLAG "-Xlinker" " ")
unset(CMAKE_C_LINKER_WRAPPER_FLAG_SEP)
test("LINKER:-z,defs")

#gcc:
message("------------------------")
message("GCC test (should give \"-Wl,-z,defs\")")
set (CMAKE_C_LINKER_WRAPPER_FLAG "-Wl,")
set(CMAKE_C_LINKER_WRAPPER_FLAG_SEP ",")
test("LINKER:-z,defs")

#sunpro:
message("------------------------")
message("sunpro test (should give \"-Qoption ld -z,defs\")")
set (CMAKE_C_LINKER_WRAPPER_FLAG "-Qoption" "ld" " ")
set (CMAKE_C_LINKER_WRAPPER_FLAG_SEP ",")
test("LINKER:-z,defs")
tkittel commented 1 year ago

Making the changes now. Instead of editing ExtDep_Python.cmake, I simply decode all link options in WriteResults.cmake just before extraction, which hopefully means that it will catch all future issues of the same kind.

Btw., our cmake project() call was missing LANGUAGES C CXX parameters, which baffles me a bit.

tkittel commented 1 year ago

Pushed now, let is see if I broke everything (well, my ubuntu where there are no LINKER: flags did not break).