root-project / root

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
https://root.cern
Other
2.72k stars 1.3k forks source link

[RDataFrame] Unable to cacheread remote file #15028

Open AlkaidCheng opened 8 months ago

AlkaidCheng commented 8 months ago

Check duplicate issues.

Description

When input files to RDataFrame are remote files, force caching of remote files does not work and the remote files will be downloaded every time.

Reproducer

import os
import ROOT

user = os.environ['USER']
outdir = f"/eos/user/{user[0]}/{user}"
filename = os.path.join(outdir, "test.root")
# create dummy root file
ROOT.RDataFrame(100).Define("x", "1").Snapshot("test", filename)

ROOT.TFile.SetCacheFileDir("/tmp", True, True)
# this does not trigger loading of cached root file
ROOT.RDataFrame("test", f"root://eosuser.cern.ch/{filename}").Sum("x").GetValue()

This is because internally RDataFrame will create a TChain using ROOT.Internal.TreeUtils.MakeChainForMT(treename), which creates a TChain object with the mode ROOT.TChain.kWithoutGlobalRegistration. This in turn forces the TFile open option to be "READ_WITHOUT_GLOBALREGISTRATION". This causes the TFile to be opened without caching since it only checks the fgCacheFileForce flag when option is "READ"

ROOT version

6.30/04 (LCG105a)

Installation method

LCG (Swan)

Operating system

Linux

Additional context

No response

AlkaidCheng commented 8 months ago

I think one possible solution will be to manually edit the options (like here) inside TFile::Open (i.e. somewhere here) so that the _WITHOUT_GLOBALREGISTRATION suffix is not interfering with the remote caching decision.