Closed alibenn closed 3 years ago
Thank you @alibenn for reporting this.
I understand you see this error when you select the Spark backend. Would it be possible for you to test with the local backend and let us know if the error is also there? The fix might need to be added to RDataFrame itself and not PyRDF.
@vepadulano might be able to have a look.
I wasn't using the spark backend. There is no spark cluster at the site where the data are. I used whatever the default is and enabled multithreading. ROOT.ROOT.EnableImplicitMT()
In case that matters the software stack used is LCG 96b on centos7 with clang8.
@alibenn can you let us know whether your workaround (using TTree.SetMaxTreeSize
) works? We are a bit low on manpower now to tackle PyRDF issues, but @vepadulano will be joining us beginning of February and he will be able to take care of it.
It just finished and ran though so the workaround worked. real 118m45.256s user 439m17.230s sys 28m22.853s It produced a 120 GB file.
Ok thank you Albert, we will have a look at the issue asap.
Another thing you could try is to run without multi-threading, as the engine to generate the file is different in sequential and MT mode (TFileMerger is not used in sequential mode).
Sure, I will try that overnight. The normal run took 440 CPU minutes so that is what I expect for the runtime. So you want a test where the switchover actually happens (without workaround)?
So you want a test where the switchover actually happens (without workaround)?
Yes that'd be great!
The test is running. The only relevant changes are the removal of EnableImplicitMT and SetMaxTreeSize.
I have done the test and it also fails at the same place. This time with a segmentation fault. Here is the stack trace. Would another run with the "deb" instead of the "opt" platform be useful? ... Fill: Switching to new file: /scratch/test_1.root
Break segmentation violation
Thread 2 (Thread 0x7fa3779d1700 (LWP 118427)):
GLIBC_2.2.5 () from /lib64/libpthread.so.0
Thread 1 (Thread 0x7fa3a5f5b740 (LWP 118392)):
===========================================================
===========================================================
real 516m51.622s user 376m7.708s sys 4m45.716s
Thank you for running the test, we will report back here when we have news. In the meantime, please keep using your workaround.
Hi @alibenn , Thank you for your report and sorry for the late reply. I'm investigating the issue. I created a TTree spanning over multiple files (although for practical reasons it's < 100GB ). I couldn't reproduce the issue you have with sequential Snapshot in PyRDF, but I did reproduce it with multithreaded Snaspshot, both in PyRDF and ROOT RDataFrame alone. I'll keep you updated. Cheers, Vincenzo
Hi @alibenn, Indeed there was an issue with upstream ROOT when using Snapshot with IMT enabled. It should be fixed by https://github.com/root-project/root/pull/6570
I still couldn't reproduce the issue you reported for the sequential case. the TTree should change file by default when it's greater than roughly 100 GB, if I understand correctly at first you were using SetMaxTreeSize
so switching files before the 100GB limit, but the last time you tried without that option so if there's a switchover it means you are storing more than 100GB in memory?
Let me know if I can still help you
The IMT side of this issue has been solved in the linked PR. In the absence of a reproducer, I'll close this issue and if it comes up again a new one can be opened
When running on a large data set at some point TTree decides the file is too large and switches to a new file. This is in TTree::ChangeFile and usually works but when called from PyRDF the process aborts. The output is:
Fill: Switching to new file: /scratch/test2_1.root Fatal in: Output file of the TFile Merger (targeting /scratch/test2.root) has been deleted (likely due to a TTree larger than 100Gb)
aborting
...a lot of stack traces...
When I look at the files the relevant files are:
So the relevant file is not in excess of 100 GB (but close to that) and the second file where the whole thing was supposed to switch over to just exists but was not filled significantly. As a workaround I will try running again with ROOT.TTree.SetMaxTreeSize(10000000000000) (10TB) but this should happen internally when Snapshot encounters a data set larger than the limit so that this feature is effectively disabled and thus the bug is not triggered.