root-project / root

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
https://root.cern
Other
2.65k stars 1.26k forks source link

TMVA missing protection against mis-configuration. #15211

Open pcanal opened 5 months ago

pcanal commented 5 months ago

Check duplicate issues.

Description

As seen on https://github.com/root-project/roottest/pull/1106#issuecomment-2050216040 and https://github.com/root-project/roottest/pull/1106#issuecomment-2050220592 there are several tutorials failures in the TMVA tutorials.

We do not see those failures in the new CI thus it could be that either the new CI is not running them (or ignoring the failures) or there is a missing protection against a bad/not-supported configuration.

This is a blocker to completely retiring the jenkins build as they currently "appear" to show failure that the new CI does not detect.

Reproducer

A sample of the failure output in case the log are deleted can be found at https://gist.github.com/pcanal/d68be81d3ea618a83fb50a11d4ea1d1d

ROOT version

master

Installation method

Jenkins/CI

Operating system

Ubuntu 22.04

Additional context

Seen on ROOT-ubuntu2004/python3. Running on root-ubuntu-2004-1.cern.ch and ROOT-ubuntu2204/nortcxxmod.Running on root-ubuntu-2204-2.cern.ch

vepadulano commented 5 months ago

My two cents is we should disable the tests (and stop using Jenkins for PR builds)

pcanal commented 5 months ago

My two cents is we should disable the tests

Do you mean short term while we fix the issue or do you mean just ignore the issue?

In this case, unless the tutorial (!) are showing something that the user are very unlikely to do, I think we need to go further and make sure that this same crash/problem won't happen to the user.

(and stop using Jenkins for PR builds)

I agree ... except that they seems to be probing an area of the phase space we are not testing elsewhere.

Note: if the issue is as simple as "the installed version of some dependent product is too old so there is no point in fixing the problem", we still need to fix the CMake configuration to fail when asked to use those older version.

vepadulano commented 5 months ago

Do you mean short term while we fix the issue or do you mean just ignore the issue? In this case, unless the tutorial (!) are showing something that the user are very unlikely to do, I think we need to go further and make sure that this same crash/problem won't happen to the user.

If the feature does not work, and it is advertised as a public tutorial, then it should be either fixed soon (i.e. before next release imho) or we need to back down on the feature and inject some sort of warning mechanism that the feature is not really working. Same as we had to recently do for https://github.com/root-project/root/issues/15197 via https://github.com/root-project/root/pull/15198.

We cannot provide tutorials that just do not work, so I don't see a third way

lmoneta commented 5 months ago

This is a again the same problem which appeared before, see https://github.com/root-project/root/pull/14716 The conflict between std::regexp and pytorch jit. Either we can avoid using in pytorch or we somehow try to live with this issue hoping at some point it will be fixed in pytorch. We don't see in the new CI because pytorch is not installed in the ubuntu nodes. I will add it before retiring the old CI

pcanal commented 5 months ago

Where do we still use std::regexp ?

guitargeek commented 5 months ago

In cppyy, and we know for sure that that usage is problematic because we see the same problem with upstream cppyy: https://github.com/wlav/cppyy/issues/227 https://github.com/root-project/root/blob/master/bindings/pyroot/cppyy/CPyCppyy/src/Converters.cxx#L61

So it's not only pytorch, but also xgboost. For the xgboost case, I could fix the crash by importing xgboost before ROOT. Maybe for pytorch that works too?

pcanal commented 5 months ago

By curiosity, did we notice the clash during the CPPYY merge/update process? If not in which other step did we see it (and/or miss seeing it when we should). I.e. I am asking in the context of improving our processes.

guitargeek commented 5 months ago

It didn't hit me when working on the cppyy upgrade (#14507), all tests on Jenkins and the new CI were green there.

I only noticed it a few days ago when enabling the xgboost tests to test my new development here: https://github.com/root-project/root/pull/15173

Why these TMVA tutorial failures come only now several weeks after the cppyy upgrade? I don't know. I'm just catching up with the recent discussions on Python requirements

pcanal commented 5 months ago

Why these TMVA tutorial failures come only now several weeks after the cppyy upgrade?

This is a problem in itself :(. From my record (it was nice to received failure report by emails, it seems we don't get that from the new CI), that this is the first failure:

https://lcgapp-services.cern.ch/root-jenkins/job/root-pullrequests-handler/30970/console

I.e on the nightly of April 3rd, 2024 (9 days ago).

But rather than time it might also be -DCTEST_TEST_EXCLUDE_NONE=On that uncovered it.

pcanal commented 5 months ago

And indeed the Ubuntu node are build by default without the -DCTEST_TEST_EXCLUDE_NONE=On

pcanal commented 5 months ago

So my understanding is that we need to:

  1. Continue having the new CI run all tests :)
  2. Do something about either CPPYY's use of std::regexp or TMVA's use of PyTorch (This could be removing support for PyTorch or temporarily disabling the tutorials until the (newly created and/or this) Issue noting the conflict CPPYY/PyTorch is resolved (for example if the early load mentioned early works).
  3. Update one of the Ubuntu configuration to have PyTorch installed (if and only if we continue supporting PyTorch ; as we can see here if we don't test it we should assume it does not work :) ]

Am I missing something?

wlav commented 5 months ago

I tried reducing the number of symbols that libcppyy.so exports using -fvisibility=hidden, combined with explicit exports of the public API functions. That works fine (is, after all, already done this way for Windows), but doesn't resolve the issue: the regex symbols remain. I found that the standard C++ headers force __attribute__((__visibility__("default"))), so there's no way to control this.

wlav commented 5 months ago

If I replace lib = ctypes.cdll.LoadLibrary(lib_path) in xgboost/core.py with lib = ctypes.CDLL(lib_path, 8), then that solves the issue.