root-project / root

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
https://root.cern
Other
2.72k stars 1.29k forks source link

Compatiblity issue: File writting with root 6.32/02 cannot be read back with root 6.10/06 #15964

Open wlampl opened 4 months ago

wlampl commented 4 months ago

Check duplicate issues.

Description

While trying to update to LCG_106_ATLAS_3 (root 6.32/02) we encountered a test failure. An intermediate file produce with this release could not be read back with an older release (6.10/06, 6.08.06), we encounter a segfault when the file is closed.

Background: ATLAS Trigger simulation of run 2 uses the release that was used for data-taking during run 2.

Reproducer

I copied the intermediate file + reproducer script to /afs/cern.ch/work/w/wlampl/public/ATEAM-1001 The script is quite simple:

from ROOT import TFile
f=TFile.Open("tmp.RDO")
f.ls()
t=f.Get("CollectionTree")
n=t.GetEntries()
for i in range(n):
    s=t.GetEntry(i)
    print(s)
f.Close()

For root versions back to about 6.16.00 it works as expected. Running with 6.08.06 and 6.10.06 (in a centos7 container), I encounter a segfault as the end. A log can be found in /afs/cern.ch/work/w/wlampl/public/ATEAM-1001/log.22.0.0

ROOT version

Writing: 6.32/02 Reading: 6.10/06 or 6.08.06

Installation method

SFT/LCG

Operating system

CentOS7

Additional context

No response

Nowakus commented 4 months ago

Let me add a reproducer where you only need to open the file and try to exit:

% setupATLAS -c centos7 --pwd /afs/cern.ch/work/w/wlampl/public/ATEAM-1001 % asetup Athena,21.0,latest % root -b tmp.RDO

| Welcome to ROOT 6.08/06 http://root.cern.ch | Attaching file tmp.RDO as _file0... Warning in : no dictionary for class ROOT::TIOFeatures is available (TFile *) 0x29cf190 root [1] .q

Break segmentation violation This is the entire stack trace of all threads:

0 0x00007f6cdd6c560c in waitpid () from /lib64/libc.so.6

1 0x00007f6cdd642f62 in do_system () from /lib64/libc.so.6

2 0x00007f6cdecce102 in TUnixSystem::StackTrace() () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so

3 0x00007f6cdecd061c in TUnixSystem::DispatchSignals(ESignals) () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so

4

5 0x0000000001209080 in ?? ()

6 0x00007f6cdec52005 in TList::FindObject(TObject const*) const () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so

7 0x00007f6cdec5237c in TList::Clear(char const*) () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so

8 0x00007f6cdec50a01 in THashTable::Clear(char const*) () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so

9 0x00007f6cdec504dd in THashList::Clear(char const*) () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so

10 0x00007f6cdec9d1a7 in TListOfDataMembers::Unload() () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so

11 0x00007f6cdec7f2d0 in TClass::SetUnloaded() () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so

12 0x00007f6cdec4a574 in ROOT::RemoveClass(char const*) () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so

13 0x00007f6cdec9926e in ROOT::TGenericClassInfo::~TGenericClassInfo() () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so

14 0x00007f6cdd639ce9 in __run_exit_handlers () from /lib64/libc.so.6

jcatmore commented 4 months ago

Hi @martamaja10 ,

thanks for looking at this. We see you've assigned @dpiparo but we understand that he's away for a couple of weeks, and ideally we'd like this to be addressed sooner if possible. Is there someone else in the team who could look at this before?

The problem is, this issue prevents us from using LCG106 and so it holds up several developments.

Thanks!

James

martamaja10 commented 4 months ago

Hi @jcatmore,

sure, I'll find another person in the team to take a look at this ASAP.

Cheers, Marta

pcanal commented 4 months ago

Most likely backporting this commit: https://github.com/root-project/root/commit/08b34d72a800bd48ea4655f17075de0ef3ca72cb will fix the problem.

pcanal commented 4 months ago

See https://github.com/root-project/root/pull/15968 and https://github.com/root-project/root/pull/15969

jblomer commented 4 months ago

This issue is most likely due to a change that inadvertently broke forward compatibility: https://github.com/root-project/root/issues/14793

You should have seen this already with 6.30 though. Is there an explanation why 6.30 did not trigger the error?

There are two ways to proceed (if the issue is what we think it is):

The second option would be useful to run at least once to confirm that we identified the right cause.

Nowakus commented 4 months ago

Is there any drawback in doing SetBit(TFile::k630forwardCompatibility) for every file we produce now?

pcanal commented 4 months ago

Is there any drawback in doing SetBit(TFile::k630forwardCompatibility) for every file we produce now?

The main drawbacks is forgetting to eventually remove it :). The technical drawback is slightly worse and unstable (see for example; https://github.com/root-project/root/issues/12438) compression.

jcatmore commented 4 months ago

You should have seen this already with 6.30 though. Is there an explanation why 6.30 did not trigger the error?

Just to comment about 6.30: we didn't look at this release apart from to do a compilation test, so indeed, most likely the issue is there as well as per your expectation.

dpiparo commented 4 months ago

Hi. I just wanted to understand whether on the ATLAS side the issue was further investigated

jchapman-hep commented 4 months ago

We have added a call to SetBit(TFile::k630forwardCompatibility) when writing files that will need to be read by old release branches as part of our standard workflows for earlier LHC runs. This allowed the jobs using older releases to run successfully. This is necessary as the ability to simulate our Trigger is tied to the releases that were being used for data-taking at that time. We would rather that we didn't have to do this though of course.

dpiparo commented 4 months ago

I am sorry ROOT did not work out of the box in this case. We are really working hard to provide not only backward but also forward compatibility. In this particular situation, it was not possible.

jchapman-hep commented 3 months ago

Hi @dpiparo,

We understand why a fix on your side was not possible in this case, but can you confirm that the workaround of reading files in older releases (6.10/06, 6.08.06) will be part of your tests going forward please? ATLAS will need this feature to be supported for new ROOT versions until such time as we decide to change our support policy for legacy data. (This currently requires Trigger Simulation to be run in the data-taking release from the year in question.)

pcanal commented 3 months ago

On a side note, we back-ported the ability to read the files without the forward compatibility bit to the patch branch for v6.10 and v6.08.