root-project / root

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
https://root.cern
Other
2.54k stars 1.23k forks source link

`hadd` segfaults when the output file is too large #10102

Open eguiraud opened 2 years ago

eguiraud commented 2 years ago

To reproduce:

xrdcp root://eospublic.cern.ch//eos/root-eos/benchmark/Run2012B_SingleMu.root .
hadd -ff Run2012B_SingleMu10x.root Run2012B_SingleMu.root Run2012B_SingleMu.root Run2012B_SingleMu.root Run2012B_SingleMu.root Run2012B_SingleMu.root Run2012B_SingleMu.root Run2012B_SingleMu.root Run2012B_SingleMu.root Run2012B_SingleMu.root Run2012B_SingleMu.root

On my laptop, with current master, this crashes after a few minutes with:

  ~/S/w/coffea-benchmarks (master *=) hadd -ff Run2012B_SingleMu10x.root Run2012B_SingleMu.root Run2012B_SingleMu.root Run2012B_SingleMu.root Run2012B_SingleMu.root Run2012B_SingleMu.root Run2012B_SingleMu.root Run2012B_SingleMu.root Run2012B_SingleMu.root Run2012B_SingleMu.root Run2012B_SingleMu.root
hadd Target file: Run2012B_SingleMu10x.root
hadd compression setting for all output: 1
hadd Source file 1: Run2012B_SingleMu.root
hadd Source file 2: Run2012B_SingleMu.root
hadd Source file 3: Run2012B_SingleMu.root
hadd Source file 4: Run2012B_SingleMu.root
hadd Source file 5: Run2012B_SingleMu.root
hadd Source file 6: Run2012B_SingleMu.root
hadd Source file 7: Run2012B_SingleMu.root
hadd Source file 8: Run2012B_SingleMu.root
hadd Source file 9: Run2012B_SingleMu.root
hadd Source file 10: Run2012B_SingleMu.root
hadd Target path: Run2012B_SingleMu10x.root:/
Fill: Switching to new file: Run2012B_SingleMu10x_1.root
Fatal in <TFileMerger::RecursiveRemove>: Output file of the TFile Merger (targeting Run2012B_SingleMu10x.root) has been deleted (likely due to a TTree larger than 100Gb)
aborting
#0  0x00007fea7e19b48a in wait4 () from /usr/lib/libc.so.6
#1  0x00007fea7e10d09b in do_system () from /usr/lib/libc.so.6
#2  0x00007fea7ea7fdac in TUnixSystem::Exec (this=0x5593666eb200, shellcmd=0x559368e19160 "/home/blue/ROOT/master/cmake-build-foo/etc/gdb-backtrace.sh 538334 1>&2") at ../core/unix/src/TUnixSystem.cxx:2108
#3  0x00007fea7ea8069e in TUnixSystem::StackTrace (this=0x5593666eb200) at ../core/unix/src/TUnixSystem.cxx:2399
#4  0x00007fea7e911bc1 in DefaultErrorHandler (level=6000, abort_bool=true, location=0x7fea7d7ab1b5 "TFileMerger::RecursiveRemove", msg=0x55936842c8a0 "Output file of the TFile Merger (targeting Run2012B_SingleMu10x.root) has been deleted (likely due to a TTree larger than 100Gb)") at ../core/base/src/TErrorDefaultHandler.cxx:174
#5  0x00007fea7e9ee212 in ErrorHandler(Int_t, const char *, const char *, typedef __va_list_tag __va_list_tag *) (level=6000, location=0x7fea7d7ab1b5 "TFileMerger::RecursiveRemove", fmt=0x7fea7f1cb4c8 "Output file of the TFile Merger (targeting %s) has been deleted (likely due to a TTree larger than 100Gb)", ap=0x7ffcad78dbb0) at ../core/foundation/src/TError.cxx:152
#6  0x00007fea7e92a7de in TObject::DoError (this=0x7ffcad78f2b0, level=6000, location=0x7fea7f1cb532 "RecursiveRemove", fmt=0x7fea7f1cb4c8 "Output file of the TFile Merger (targeting %s) has been deleted (likely due to a TTree larger than 100Gb)", va=0x7ffcad78dbb0) at ../core/base/src/TObject.cxx:860
#7  0x00007fea7e92acd1 in TObject::Fatal (this=0x7ffcad78f2b0, location=0x7fea7f1cb532 "RecursiveRemove", fmt=0x7fea7f1cb4c8 "Output file of the TFile Merger (targeting %s) has been deleted (likely due to a TTree larger than 100Gb)") at ../core/base/src/TObject.cxx:925
#8  0x00007fea7ef90e56 in TFileMerger::RecursiveRemove (this=0x7ffcad78f2b0, obj=0x559367a40820) at ../io/io/src/TFileMerger.cxx:1081
#9  0x00007fea7e9ad0bf in THashList::RecursiveRemove (this=0x5593666f1840, obj=0x559367a40820) at ../core/cont/src/THashList.cxx:354
#10 0x00007fea7e8d4e14 in TROOT::RecursiveRemove (this=0x7fea7ec46740 <ROOT::Internal::GetROOT1()::alloc>, obj=0x559367a40820) at ../core/base/src/TROOT.cxx:2455
#11 0x00007fea80417f82 in ROOT::CallRecursiveRemoveIfNeeded (obj=(TObject) = {...}) at ../core/base/inc/TROOT.h:398
#12 0x00007fea7e927b18 in TNamed::~TNamed (this=0x559367a40820, __in_chrg=<optimized out>) at ../core/base/src/TNamed.cxx:45
#13 0x00007fea7e9081d5 in TDirectory::~TDirectory (this=0x559367a40820, __in_chrg=<optimized out>) at ../core/base/src/TDirectory.cxx:117
#14 0x00007fea7ef7c856 in TDirectoryFile::~TDirectoryFile (this=0x559367a40820, __in_chrg=<optimized out>) at ../io/io/src/TDirectoryFile.cxx:202
#15 0x00007fea7ef9673f in TFile::~TFile (this=0x559367a40820, __in_chrg=<optimized out>) at ../io/io/src/TFile.cxx:566
#16 0x00007fea7ef96776 in TFile::~TFile (this=0x559367a40820, __in_chrg=<optimized out>) at ../io/io/src/TFile.cxx:566
#17 0x00007fea7e9288d9 in TObject::Delete (this=0x559367a40820) at ../core/base/src/TObject.cxx:178
#18 0x00007fea802bbf82 in TTree::ChangeFile (this=0x559368ca6c20, file=0x559367a40820) at ../tree/tree/src/TTree.cxx:2813
#19 0x00007fea802bf66e in TTree::CopyEntries (this=0x559368ca6c20, tree=0x559368ca7a30, nentries=53446198, option=0x7ffcad78e7a1 " fast", needCopyAddresses=true) at ../tree/tree/src/TTree.cxx:3567
#20 0x00007fea802c825a in TTree::Merge (this=0x559368ca6c20, li=0x7ffcad78e560, info=0x7ffcad78e780) at ../tree/tree/src/TTree.cxx:6940
#21 0x00007fea8020066b in ROOT::merge_TTree (obj=0x559368ca6c20, coll=0x7ffcad78e560, info=0x7ffcad78e780) at tree/tree/G__Tree.cxx:4209
#22 0x00007fea7ef8e60d in TFileMerger::MergeOne (this=0x7ffcad78f2b0, target=0x559367a40820, sourcelist=0x7ffcad78f308, type=12, info=..., oldkeyname="", allNames=..., status=
0x7ffcad78e6ec: true, onlyListed=
0x7ffcad78e6ed: false, path="", current_sourcedir=0x559367c95120, current_file=0x559367c95120, key=0x55936842d580, obj=0x559368ca6c20, nextkey=...) at ../io/io/src/TFileMerger.cxx:660
#23 0x00007fea7ef8f9ae in TFileMerger::MergeRecursive (this=0x7ffcad78f2b0, target=0x559367a40820, sourcelist=0x7ffcad78f308, type=12) at ../io/io/src/TFileMerger.cxx:878
#24 0x00007fea7ef902d4 in TFileMerger::PartialMerge (this=0x7ffcad78f2b0, in_type=12) at ../io/io/src/TFileMerger.cxx:968
#25 0x00007fea7ef8ce3f in TFileMerger::Merge (this=0x7ffcad78f2b0) at ../io/io/src/TFileMerger.cxx:372
#26 0x000055936623997a in operator() (__closure=0x7ffcad78eee0, merger=...) at ../main/src/hadd.cxx:473
#27 0x0000559366239d6e in operator() (__closure=0x7ffcad78ee90, merger=..., start=3, nFiles=10) at ../main/src/hadd.cxx:501
#28 0x000055936623c2a0 in main (argc=13, argv=0x7ffcad78f618) at ../main/src/hadd.cxx:543
fish: Job 1, 'hadd -ff Run2012B_SingleMu10x.r…' terminated by signal SIGABRT (Abort)
⏎                                                                                                                                                                                                                                        
eguiraud commented 2 years ago

Before the crash, two output files are produced. Run2012B_SingleMu10x.root seems well-formed, while trying to open Run2012B_SingleMu10x_1.root results in

Error in <TFile::ReadBuffer>: error reading all requested bytes from file Run2012B_SingleMu10x_1.root, got 272 of 300
Error in <TFile::Init>: Run2012B_SingleMu10x_1.root failed to read the file type data.
ferdymercury commented 3 months ago

not sure if this might help, there was a memleak in this function: https://github.com/root-project/root/pull/15059

Related: https://root-forum.cern.ch/t/fatal-in-tfilemerger-recursiveremove-output-file-hadd-100gb-ttree/31846 https://root-forum.cern.ch/t/root-6-04-14-hadd-100gb-and-rootlogon/24581 https://root-forum.cern.ch/t/hadd-100-gb-ttree/38737