root-project / root

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
https://root.cern
Other
2.64k stars 1.26k forks source link

Missing dependency or clean up in TMVA test/tutorials #16553

Open pcanal opened 3 days ago

pcanal commented 3 days ago

Check duplicate issues.

Description

On a large node (127 cores, 128 GB), I ran:

  1. ctest -j 32
  2. ctest --rerun-failed
  3. ctest -j 32

After 1. many test failes due to lack of resources (running out of threads, see #16552 ):

47:PyMVA-Keras-Classification
348:PyMVA-Keras-Regression
349:PyMVA-Keras-Multiclass
350:gtest-tmva-pymva-test-TestRModelParserKeras
984:tutorial-tmva-TMVA_SOFIE_GNN_Application
985:tutorial-tmva-TMVA_SOFIE_Keras
986:tutorial-tmva-TMVA_SOFIE_Keras_HiggsModel
988:tutorial-tmva-TMVA_SOFIE_RDataFrame
990:tutorial-tmva-TMVA_SOFIE_RSofieReader
1238:tutorial-tmva-RBatchGenerator_PyTorch-py
1239:tutorial-tmva-RBatchGenerator_TensorFlow-py
1246:tutorial-tmva-TMVA_SOFIE_Models-py
1247:tutorial-tmva-TMVA_SOFIE_RDataFrame-py
1252:tutorial-tmva-keras-GenerateModel-py
1253:tutorial-tmva-keras-MulticlassKeras-py

However in 2., several tests still failed (even-though resources where no longer an issue):

50:gtest-tmva-pymva-test-TestRModelParserKeras
984:tutorial-tmva-TMVA_SOFIE_GNN_Application
986:tutorial-tmva-TMVA_SOFIE_Keras_HiggsModel
988:tutorial-tmva-TMVA_SOFIE_RDataFrame
990:tutorial-tmva-TMVA_SOFIE_RSofieReader
1247:tutorial-tmva-TMVA_SOFIE_RDataFrame-py

The errors listed there included:

IncrementalExecutor::executeFunction: symbol 'saxpy_' unresolved while linking [cling interface function]!
IncrementalExecutor::executeFunction: symbol 'sgemm_' unresolved while linking [cling interface function]!
tutorials/tmva/TMVA_SOFIE_RDataFrame.C:29:10: fatal error: 'Higgs_trained_model.hxx' file not found
/tutorials/tmva/TMVA_SOFIE_GNN_Application.C:10:10: fatal error: 'encoder.hxx' file not found

From this I conclude that those tests (in particular TMVA_SOFIE_RDataFrame.C and tutorials/tmva/TMVA_SOFIE_GNN_Application.C) are missing a dependencies that failed in the first run.

Note tutorial-tmva-TMVA_SOFIE_Keras_HiggsModel and tutorial-tmva-TMVA_SOFIE_RDataFrame-py are indeed needing TMVA_Higgs_Classification.C to run first (it says so in the output! :) ).

tutorial-tmva-TMVA_SOFIE_RSofieReader is asking for Higgs_trained_model.h5

gtest-tmva-pymva-test-TestRModelParserKeras is missing the symbol sgemm_ (see below)

However when rerunning (where this time somehow there was no resource related failures), I still got several failures:

346:gtest-tmva-pymva-test-TestRModelParserPyTorch
350:gtest-tmva-pymva-test-TestRModelParserKeras
984:tutorial-tmva-TMVA_SOFIE_GNN_Application
988:tutorial-tmva-TMVA_SOFIE_RDataFrame
990:tutorial-tmva-TMVA_SOFIE_RSofieReader

all due to:

IncrementalExecutor::executeFunction: symbol 'sgemm_' unresolved while linking [cling interface function]!

or both

IncrementalExecutor::executeFunction: symbol 'saxpy_' unresolved while linking [cling interface function]!
IncrementalExecutor::executeFunction: symbol 'sgemm_' unresolved while linking [cling interface function]!

Which may be due to either a badly formed result of the failing run (1) or due to an external package that does not have the correct version number?

Reproducer

ctest -j 32 # and get lots of out of resource failures
ctest --rerun-failed
ctest -j 32

ROOT version

master

Installation method

hand build

Operating system

Alma9

Additional context

jupyter-pcanal-rootdevel:quick-devel pcanal$ bin/root-config --features
cxx17 asimage builtin_clang builtin_cling builtin_gtest builtin_llvm builtin_lz4 builtin_lzma builtin_nlohmannjson builtin_openui5 builtin_tbb builtin_vdt builtin_xxhash builtin_zlib builtin_zstd clad dataframe davix gdml http imt pyroot roofit root7 rpath runtime_cxxmodules shared sqlite ssl tmva tmva-pymva tpython spectrum vdt x11 xml xrootd
dpiparo commented 3 days ago

Hi @pcanal , thanks for this report. Hopefully the solution will help also with fewer threads. I am not sure though that the unresolved while linking is due to the high thread count. Can you confirm that you do not see these errors with 8-16 threads?

pcanal commented 1 day ago

I am not sure though that the unresolved while linking is due to the high thread count.

I think you might be right. The best way forward is to track down where those missing symbol are suppose to come from.

dpiparo commented 16 hours ago

Thanks for the comment. At this point this issue seems to conflate two things:

  1. The dependencies of python tests. This should have been addressed by #16555
  2. The missing symbols.

If 1. is confirmed to be solved, I would say that at least this issue ought to be closed and one about missing symbols opened. However, even if an issue dedicated to the missing symbols is opened, it's not clear, at least to me, how the problem can be reproduced. So far we have no indication of it in our CI: can it be due to a somewhat imprecise formulation of the python dependencies in the requirements.txt file that affects your platform?