scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.
https://uproot.readthedocs.io
BSD 3-Clause "New" or "Revised" License
233 stars 73 forks source link

Writing remote files via XRootD #738

Closed dcervenkov closed 1 year ago

dcervenkov commented 1 year ago

Uproot can open and read remote files, but AFAIK, it cannot write them - recreate() needs a local path only.

Do you plan to support this? Or is there a reason why this can't work/is problematic?

Moelf commented 1 year ago

just write to a local file and xrdcp

dcervenkov commented 1 year ago

Yes, that's the workaround I'm currently using, but I'm wondering if one could use iterate and recreate to process large file(s) that cannot fit into local storage by chunks.

jpivarski commented 1 year ago

Yeah, we're not planning on supporting it, though there's a placeholder for it: the local file-based Sink is implemented as an abstraction that could (in principle) be replaced by a remote file Sink.

If it is implemented, I don't see a way to get the performance reasonable for any format except RNTuple. TTrees, for instance, involve a lot of seeking forward and back to keep the file state valid. For a remote file, that means lots of round-trip communications, which is bad for low-latency environments. Maybe we could let the file state get invalid, but we do have to physically write some things because we can't keep it all in memory and dump at the end. (Or if you can, make a memory file and send it all at once!)

For RNTuple, it's plausible, since all information that isn't known until a time $T$ in writing is written after data written at time $t < T$. That is, all of the "number of entries," "where to find chunks," etc. are in a footer that gets written last or repeatedly re-written. So the RNTuple part could plausibly be written over a low-latency network well, but it's embedded within traditional ROOT I/O that will still require some seeking back and forth.

If we do implement remote Sinks, so that RNTuple can take advantage of them (and we let TTree be inefficient), we'd probably want to do it through https://github.com/CoffeaTeam/fsspec-xrootd to simplify the interface.

Bottom line: not planning on it, but we could change our plans, depending on how much RNTuple improves the situation.

(I'm going to make this a Discussion, rather than an Issue.)