vepadulano / PyRDF

Python Library for doing ROOT RDataFrame analysis
https://pyrdf.readthedocs.io/en/latest/
9 stars 7 forks source link

Snapshot fails when TTree file switchover is triggered #91

Closed alibenn closed 3 years ago

alibenn commented 4 years ago

When running on a large data set at some point TTree decides the file is too large and switches to a new file. This is in TTree::ChangeFile and usually works but when called from PyRDF the process aborts. The output is:

Fill: Switching to new file: /scratch/test2_1.root Fatal in : Output file of the TFile Merger (targeting /scratch/test2.root) has been deleted (likely due to a TTree larger than 100Gb) aborting ...a lot of stack traces...

When I look at the files the relevant files are:

ls -lh 276 Jan 8 23:17 test2_1.root 94G Jan 8 23:17 test2.root

So the relevant file is not in excess of 100 GB (but close to that) and the second file where the whole thing was supposed to switch over to just exists but was not filled significantly. As a workaround I will try running again with ROOT.TTree.SetMaxTreeSize(10000000000000) (10TB) but this should happen internally when Snapshot encounters a data set larger than the limit so that this feature is effectively disabled and thus the bug is not triggered.

etejedor commented 4 years ago

Thank you @alibenn for reporting this.

I understand you see this error when you select the Spark backend. Would it be possible for you to test with the local backend and let us know if the error is also there? The fix might need to be added to RDataFrame itself and not PyRDF.

@vepadulano might be able to have a look.

alibenn commented 4 years ago

I wasn't using the spark backend. There is no spark cluster at the site where the data are. I used whatever the default is and enabled multithreading. ROOT.ROOT.EnableImplicitMT()

alibenn commented 4 years ago

In case that matters the software stack used is LCG 96b on centos7 with clang8.

etejedor commented 4 years ago

@alibenn can you let us know whether your workaround (using TTree.SetMaxTreeSize) works? We are a bit low on manpower now to tackle PyRDF issues, but @vepadulano will be joining us beginning of February and he will be able to take care of it.

alibenn commented 4 years ago

It just finished and ran though so the workaround worked. real 118m45.256s user 439m17.230s sys 28m22.853s It produced a 120 GB file.

etejedor commented 4 years ago

Ok thank you Albert, we will have a look at the issue asap.

Another thing you could try is to run without multi-threading, as the engine to generate the file is different in sequential and MT mode (TFileMerger is not used in sequential mode).

alibenn commented 4 years ago

Sure, I will try that overnight. The normal run took 440 CPU minutes so that is what I expect for the runtime. So you want a test where the switchover actually happens (without workaround)?

etejedor commented 4 years ago

So you want a test where the switchover actually happens (without workaround)?

Yes that'd be great!

alibenn commented 4 years ago

The test is running. The only relevant changes are the removal of EnableImplicitMT and SetMaxTreeSize.

alibenn commented 4 years ago

I have done the test and it also fails at the same place. This time with a segmentation fault. Here is the stack trace. Would another run with the "deb" instead of the "opt" platform be useful? ... Fill: Switching to new file: /scratch/test_1.root

Break segmentation violation

=========================================================== There was a crash. This is the entire stack trace of all threads:

Thread 2 (Thread 0x7fa3779d1700 (LWP 118427)):

0 0x00007fa3a5748afb in do_futex_wait.constprop () from /lib64/libpthread.so.0

1 0x00007fa3a5748b8f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0

2 0x00007fa3a5748c2b in sem_wait

GLIBC_2.2.5 () from /lib64/libpthread.so.0

3 0x00007fa3a5a88418 in PyThread_acquire_lock (lock=0x1c32db0, waitflag=1) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Python/thread_pthread.h:356

4 0x00007fa3a5a48a22 in PyEval_RestoreThread (tstate=0x440c9a0) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Python/ceval.c:359

5 0x00007fa38da56ea1 in floatsleep (secs=) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Modules/timemodule.c:1057

6 time_sleep (self=, args=) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Modules/timemodule.c:206

7 0x00007fa3a5a4fb40 in call_function (pp_stack=, oparg=) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Python/ceval.c:4376

8 PyEval_EvalFrameEx (f=, throwflag=) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Python/ceval.c:3013

9 0x00007fa3a5a49546 in PyEval_EvalCodeEx (co=, globals=, locals=, args=, argcount=1, kws=0x7fa3a5f1b068, kwcount=0, defs=0x0, defcount=0, closure=0x0) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Python/ceval.c:3608

10 0x00007fa3a59cf6d5 in function_call (func=0x7fa38e0d0b18, arg=, kw=0x7fa38084c168) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Objects/funcobject.c:523

11 0x00007fa3a59a3b2d in PyObject_Call (func=0x7fa38e0d0b18, arg=0x7fa383109250, kw=0x7fa38084c168) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Objects/abstract.c:2544

12 0x00007fa3a5a5068b in ext_do_call (func=, pp_stack=, flags=, na=, nk=) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Python/ceval.c:4690

13 PyEval_EvalFrameEx (f=, throwflag=) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Python/ceval.c:3052

14 0x00007fa3a5a54fb4 in fast_function (func=, pp_stack=0x7fa3779d08d8, n=1, na=, nk=) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Python/ceval.c:4461

15 0x00007fa3a5a4f911 in call_function (pp_stack=, oparg=) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Python/ceval.c:4396

16 PyEval_EvalFrameEx (f=, throwflag=) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Python/ceval.c:3013

17 0x00007fa3a5a54fb4 in fast_function (func=, pp_stack=0x7fa3779d0a38, n=1, na=, nk=) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Python/ceval.c:4461

18 0x00007fa3a5a4f911 in call_function (pp_stack=, oparg=) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Python/ceval.c:4396

19 PyEval_EvalFrameEx (f=, throwflag=) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Python/ceval.c:3013

20 0x00007fa3a5a49546 in PyEval_EvalCodeEx (co=, globals=, locals=, args=, argcount=1, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Python/ceval.c:3608

21 0x00007fa3a59cf6d5 in function_call (func=0x7fa38dc90578, arg=, kw=0x0) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Objects/funcobject.c:523

22 0x00007fa3a59a3b2d in PyObject_Call (func=0x7fa38dc90578, arg=0x7fa3831091d0, kw=0x0) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Objects/abstract.c:2544

23 0x00007fa3a59b56cf in instancemethod_call (func=0x7fa38dc90578, arg=0x7fa3831091d0, kw=0x0) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Objects/classobject.c:2600

24 0x00007fa3a59a3b2d in PyObject_Call (func=0x7fa3a5ec0c80, arg=0x7fa3a5f1b050, kw=0x0) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Objects/abstract.c:2544

25 0x00007fa3a5a54921 in PyEval_CallObjectWithKeywords (func=0x7fa3a5ec0c80, arg=0x7fa3a5f1b050, kw=0x0) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Python/ceval.c:4245

26 0x00007fa3a5a8db73 in t_bootstrap (boot_raw=0x4406010) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Modules/threadmodule.c:620

27 0x00007fa3a5a88316 in pythread_wrapper (arg=) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/clang800binutils/LABEL/centos7/build/externals/Python-2.7.16/src/Python/2.7.16/Python/thread_pthread.h:178

28 0x00007fa3a5742e25 in start_thread () from /lib64/libpthread.so.0

29 0x00007fa3a4d63bad in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fa3a5f5b740 (LWP 118392)):

0 0x00007fa3a4d2a1c9 in waitpid () from /lib64/libc.so.6

1 0x00007fa3a4ca7e52 in do_system () from /lib64/libc.so.6

2 0x00007fa3a4ca8201 in system () from /lib64/libc.so.6

3 0x00007fa39c3f47a9 in TUnixSystem::StackTrace() () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so

4 0x00007fa39c3f7e5b in TUnixSystem::DispatchSignals(ESignals) () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so

5

6 0x00007fa3885972c7 in ?? ()

7 0x00007fff1967d900 in ?? ()

8 0x00007fa39c2ef0b0 in ?? () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so

9 0x00007fa3885c41d0 in ?? ()

10 0x0000000009b95a88 in ?? ()

11 0x00007fa3885e0b90 in ?? ()

12 0x00000000885c2f2c in ?? ()

13 0x00007fa3885e0b90 in ?? ()

14 0x00007fff1967d9b8 in ?? ()

15 0x00007fa39c301a00 in TObject::InheritsFrom(TClass const*) const () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so

16 0x00007fff1967d9b8 in ?? ()

17 0x00007fff1967d950 in ?? ()

18 0x00007fa388597194 in ?? ()

19 0x00007fa39c301a00 in TObject::InheritsFrom(TClass const*) const () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so

20 0x00007fff1967d9b8 in ?? ()

21 0x00007fff1967d9e0 in ?? ()

22 0x00007fa3885ae76b in ?? ()

23 0x0000000000000000 in ?? ()

===========================================================

The lines below might hint at the cause of the crash. You may get help by asking at the ROOT forum http://root.cern.ch/forum Only if you are really convinced it is a bug in ROOT then please submit a report at http://root.cern.ch/bugs Please post the ENTIRE stack trace from above as an attachment in addition to anything else that might help us fixing this issue.

6 0x00007fa3885972c7 in ?? ()

7 0x00007fff1967d900 in ?? ()

8 0x00007fa39c2ef0b0 in ?? () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so

9 0x00007fa3885c41d0 in ?? ()

10 0x0000000009b95a88 in ?? ()

11 0x00007fa3885e0b90 in ?? ()

12 0x00000000885c2f2c in ?? ()

13 0x00007fa3885e0b90 in ?? ()

14 0x00007fff1967d9b8 in ?? ()

15 0x00007fa39c301a00 in TObject::InheritsFrom(TClass const*) const () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so

16 0x00007fff1967d9b8 in ?? ()

17 0x00007fff1967d950 in ?? ()

18 0x00007fa388597194 in ?? ()

19 0x00007fa39c301a00 in TObject::InheritsFrom(TClass const*) const () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so

20 0x00007fff1967d9b8 in ?? ()

21 0x00007fff1967d9e0 in ?? ()

22 0x00007fa3885ae76b in ?? ()

23 0x0000000000000000 in ?? ()

===========================================================

real 516m51.622s user 376m7.708s sys 4m45.716s

etejedor commented 4 years ago

Thank you for running the test, we will report back here when we have news. In the meantime, please keep using your workaround.

vepadulano commented 4 years ago

Hi @alibenn , Thank you for your report and sorry for the late reply. I'm investigating the issue. I created a TTree spanning over multiple files (although for practical reasons it's < 100GB ). I couldn't reproduce the issue you have with sequential Snapshot in PyRDF, but I did reproduce it with multithreaded Snaspshot, both in PyRDF and ROOT RDataFrame alone. I'll keep you updated. Cheers, Vincenzo

vepadulano commented 4 years ago

Hi @alibenn, Indeed there was an issue with upstream ROOT when using Snapshot with IMT enabled. It should be fixed by https://github.com/root-project/root/pull/6570

I still couldn't reproduce the issue you reported for the sequential case. the TTree should change file by default when it's greater than roughly 100 GB, if I understand correctly at first you were using SetMaxTreeSize so switching files before the 100GB limit, but the last time you tried without that option so if there's a switchover it means you are storing more than 100GB in memory?

Let me know if I can still help you

vepadulano commented 3 years ago

The IMT side of this issue has been solved in the linked PR. In the absence of a reproducer, I'll close this issue and if it comes up again a new one can be opened