uproot4 unable to open ROOT file with colons in the name that uproot3 can

matthewfeickert commented 4 years ago

Hi. I've come across an issue where if a ROOT file has a colon in the name of it uproot3 can open the file but uproot4 fails. I can't give you the file, but I can show you a minimal failing example and then a reproducible example with public files.

Minimal Failing Example

$ tree .
.
├── data-tree
│   └──  data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root
├── issue.py
├── requirements.txt

1 directory, 3 files
$ cat requirements.txt 
uproot
uproot4
$ docker run --rm -it -v $PWD:/data -w /data python:3.8 /bin/bash
root@510598a7f4e8:/data# python -m pip install --upgrade pip setuptools wheel
root@510598a7f4e8:/data# python -m pip install -r requirements.txt
root@510598a7f4e8:/data# python --version
Python 3.8.5
root@510598a7f4e8:/data# python -m pip list
Package        Version
-------------- -------
awkward        0.13.0
cachetools     4.1.1
numpy          1.19.1
pip            20.2.2
setuptools     49.6.0
uproot         3.12.0
uproot-methods 0.7.4
uproot4        0.0.18
wheel          0.35.1
root@510598a7f4e8:/data# cp data-tree/data16_13TeV\:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root data-tree/renamed.root
root@510598a7f4e8:/data# python
Python 3.8.5 (default, Aug  5 2020, 08:22:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import uproot as uproot3
>>> import uproot4
>>> from pathlib import Path
>>> 
>>> uproot3_file = uproot3.open(
...     "data-tree/data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root"
... )
>>> print(f"uproot3 opens file as {uproot3_file}")
uproot3 opens file as <ROOTDirectory b'/home/feickert/workarea/submitDir/data-tree//data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root' at 0x7fce6da5b5e0>
>>> 
>>> # uproot4 fails with the ':' in the filename
>>> uproot4.open(
...     "data-tree/data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root"
... )
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/uproot4/source/file.py", line 74, in __init__
    self._file = numpy.memmap(self._file_path, dtype=self._dtype, mode="r")
  File "/usr/local/lib/python3.8/site-packages/numpy/core/memmap.py", line 225, in __new__
    f_ctx = open(os_fspath(filename), ('r' if mode == 'c' else mode)+'b')
FileNotFoundError: [Errno 2] No such file or directory: 'data-tree/data16_13TeV'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/uproot4/reading.py", line 78, in open
    file = ReadOnlyFile(
  File "/usr/local/lib/python3.8/site-packages/uproot4/reading.py", line 265, in __init__
    self._source = Source(file_path, **self._options)
  File "/usr/local/lib/python3.8/site-packages/uproot4/source/file.py", line 80, in __init__
    self._fallback = uproot4.source.file.FileSource(file_path, opts)
AttributeError: module 'uproot4.source.file' has no attribute 'FileSource'
>>> # even if that is inside a pathlib object
>>> uproot4.open(
...     Path(
...         "data-tree/data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root"
...     )
... )
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/uproot4/source/file.py", line 74, in __init__
    self._file = numpy.memmap(self._file_path, dtype=self._dtype, mode="r")
  File "/usr/local/lib/python3.8/site-packages/numpy/core/memmap.py", line 225, in __new__
    f_ctx = open(os_fspath(filename), ('r' if mode == 'c' else mode)+'b')
FileNotFoundError: [Errno 2] No such file or directory: 'data-tree/data16_13TeV'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/uproot4/reading.py", line 78, in open
    file = ReadOnlyFile(
  File "/usr/local/lib/python3.8/site-packages/uproot4/reading.py", line 265, in __init__
    self._source = Source(file_path, **self._options)
  File "/usr/local/lib/python3.8/site-packages/uproot4/source/file.py", line 80, in __init__
    self._fallback = uproot4.source.file.FileSource(file_path, opts)
AttributeError: module 'uproot4.source.file' has no attribute 'FileSource'
>>> # but the file itself is fine
>>> uproot4.open("data-tree/renamed.root")
<ReadOnlyDirectory '/' at 0x7fce725e7670>

Failing Reproducible Example

# issue.py
import uproot as uproot3
import uproot4
from pathlib import Path

def main():
    # curl -sL https://github.com/scikit-hep/scikit-hep-testdata/raw/master/src/skhep_testdata/data/uproot-HZZ-lz4.root -o uproot-HZZ-lz4.root
    uproot3.open("uproot-HZZ-lz4.root")
    uproot3.open("uproot:HZZ-lz4.root")
    uproot3.open(Path("uproot:HZZ-lz4.root"))

    uproot4.open("uproot-HZZ-lz4.root")
    uproot4.open("uproot:HZZ-lz4.root")

if __name__ == "__main__":
    main()

root@510598a7f4e8:/data# curl -sL https://github.com/scikit-hep/scikit-hep-testdata/raw/master/src/skhep_testdata/data/uproot-HZZ-lz4.root -o uproot-HZZ-lz4.root
root@510598a7f4e8:/data# cp uproot-HZZ-lz4.root uproot:HZZ-lz4.root 
root@510598a7f4e8:/data# python issue.py 
Traceback (most recent call last):
  File "issue.py", line 17, in <module>
    main()
  File "issue.py", line 9, in main
    uproot3.open("uproot:HZZ-lz4.root")
  File "/usr/local/lib/python3.8/site-packages/uproot/rootio.py", line 63, in open
    raise ValueError("URI scheme not recognized: {0}".format(path))
ValueError: URI scheme not recognized: uproot:HZZ-lz4.root

Comments

I realize that this is probably because uproot4's open's path is

https://github.com/scikit-hep/uproot4/blob/b8828069b9ae52c742fb704f07e1a4e179fe7b30/uproot4/reading.py#L43-L46

and it isn't a good idea to have a file with a colon in it in general. However, there are ATLAS files that do, and it would be nice if uproot3 behavior toward filenames could still be supported. If support for this is firmly out of scope it would be great if there could be some huge warning about this in the docs.

tamasgal commented 4 years ago

Btw. a minimal working example is also just

>>> import uproot4
>>> uproot4.open("foo:bar.root")

The problem is this line https://github.com/scikit-hep/uproot4/blob/b8828069b9ae52c742fb704f07e1a4e179fe7b30/uproot4/_util.py#L157

This simply assumes that whenever there is a colon, the first part is the path and the second part is the object path, which of course fails when there is a colon in the filename.

EDIT: I played around with the test suite and found for example the pattern file:/path/to/file : object path as seen here https://github.com/scikit-hep/uproot4/blob/b8828069b9ae52c742fb704f07e1a4e179fe7b30/tests/test_0016-interpretations.py#L84

This would be another case to cover inside file_object_path_split(). I am not sure if this pattern is used in the wild, but I think restricting it to file:// would probably be the better choice. But that's a minor issue.

Anyways, the file_object_path_split() needs to be tweaked to cover all the cases when accepting : in filenames and respect the optional object path.

If the object path is always specified with leading and trailing spaces (" : "), then I think it's fairly easy.

tamasgal commented 4 years ago

Funnily this tiny change let's quite a few tests pass but it obviously fails when the object path is appended without leading and trailing spaces around the :

    try:
        index = path.rindex(" : ")
    except ValueError:
        return path, None

Unfortunately there is an ambiguity here, unless we assume that every root file ends with .root.

tamasgal commented 4 years ago

I created a pull request (https://github.com/scikit-hep/uproot4/pull/80) which however comes with a grain of salt, so let's discuss further.

jpivarski commented 4 years ago

I was going to say that there's an option that simply turns off the colon-splitting, but I look now and I can't find it. I'm sure it was there. Anyway, that was the intention: that a boolean parameter, by default true, would specify whether the filename gets split into filesystem/URL path and in-ROOT party or not.

Also, the splitting character used to be a vertical bar | because this is a very implausible character for a filename, given the amount of trouble it would cause in the shell. The reason I changed it was because it's not a natural separator—nothing about it suggests that this should be the dividing line between paths. By contrast, a colon is used this way (in shell variable lists, such as PATH).

I switched to a colon because it's an illegal character for filenames in MacOS and (if I remember right) URLs. Unix filenames can have almost any character in them, but considering the use of colons in file lists like PATH, that seemed implausible. It also seemed unlikely in Windows due to drive letter confusion.

So the colon had the right guessable meaning and didn't seem like it could be common in real file names. But ATLAS file names have colons in them? That's an important case. Maybe we should switch back to vertical bars?

At the very least, there should be a boolean parameter that just turns off the splitting. The uproot4.open function is already documented, so in principle it's discoverable, and anyone who doesn't discover it can be told pretty quickly, "set split_path_with_bar=False." I'm surprised it's not there because I coulda sworn I added that. It's a two line change (adding the parameter default and the if statement).

On assuming ROOT files always end with ".root", that's a stricter retirement than not including colons. I'd rather not make that assumption.

tamasgal commented 4 years ago

Ah OK, with that additional option it's a bit better but still may cause confusion. I think switching to | is better since sticking to : and the current implementation means that at least in case of ATLAS files one could not load an object path via open(). Or do you consider adding an additional object_path= parameter?

Btw. : in filenames work on macOS (it was an illegal character on class Mac OS and on the old carbon layer of Mac OS).

jpivarski commented 4 years ago

Another possibility is to require paths-in-file to start with a slash (because they are absolute, after all), and then the real separator would be ":/", which works for ATLAS.

Or to only split with a colon (a convenience) if the filename ends with ".root", while not requiring filenames to always end with ".root". Sorry if I misunderstood that that was your actual suggestion.

It drop it entirely. This is an elaboration on the idea of deep getitem (in which you can write a single path that selects through the TDirectory and TBranch hierarchy), but maybe it's an overextension? Maybe it's not bad to require those extra quotes and braces. They were intended for quick navigation.

jpivarski commented 4 years ago

I've been thinking about it all morning. An object_path parameter would defeat the purpose: it's easier to type

uproot4.open("filename.root")["pathname"]

than it is to type

uproot4.open("filename.root", object_path="pathname")

While the file_object["dir1/dir2/branch1/branch2"] feature was requested, the uproot4.open("filename.root:dir1/dir2/branch1/branch2") feature was not requested—it was an extrapolation of that request.

Since there isn't a compelling reason to do otherwise, we should do the simplest thing of not having any colon/vertical bar syntax at all, since that would be the most guessable and wouldn't require reference to the documentation before getting started. Uproot 4 is still in a fluid enough state (no documentation yet) that I can remove something like that.

tamasgal commented 4 years ago

I fully agree ;) I first thought that the column-syntax has also some special shortcuts to get to the data source even more efficiently but looking at the code it seems to me that it's just an alternative to open()[]... but correct me if I am wrong!

jpivarski commented 4 years ago

That's right: it's only syntactic sugar. The same is true of file_object["dir1/dir2/branch1/branch2"], but that syntactic sugar was helpful because people were stumbling over file_object["dir1"]["dir2"]["branch1"]["branch2"]. I guess it's okay if the number of square brackets is reduce to one, rather than being reduced to zero.

jpivarski commented 4 years ago

Now that I'm going in and deleting it, I find that it really is important: interfaces that take TTrees from multiple files need a way to express both succinctly.

lazyarray = uproot4.lazy(["batch1/*.root : treename1", "batch2/*.root : treename2"])

for arrays in uproot4.iterate(["batch1/*.root : treename1", "batch2/*.root : treename2"]):
    ...

Without this, the object path could be a separate parameter (as it was in Uproot3, which was a stumbling block because uproot.iterate("batch1/*.root", "treename1") led people to forget the tree name) but then it can't be different for different files, as above. Allowing it to be different for different files would require 2-tuples or something, if there isn't a syntactic delimiter.

So it is more important than quick and easy typing in the uproot4.open case... I have to think more about it... I'm open to suggestions.

jpivarski commented 4 years ago

Maybe it's staring me in the face: the colon could be outside of quotation marks if the file name specifier is allowed to be a dict:

lazyarray = uproot4.lazy([{"batch1/*.root": "treename1"}, {"batch2/*.root": "treename2"])

but for the common case of all files having the same name, requiring people to know this weird syntax should be unnecessary:

lazyarray = uproot4.lazy(["batch1/*.root", "batch2/*.root"], "treename")

or maybe pick the only TTree if the file has only one TTree (not counting cycle number).

jpivarski commented 4 years ago

Also allowing

lazyarray = uproot4.lazy([{"batch1/*.root": "treename1", "batch2/*.root": "treename2"])

The fact that uproot4.open would accept dicts only if they have length 1 would be because uproot4.open is a special case that only opens one file. I'm trying it out and will post a few unit tests here as a suggestion.

tamasgal commented 4 years ago

I really like the dict approach, I think that's quite transparent.

However, providing more logic is probably too much (like picking the only TTree if there is only one), I think that Python provides enough to make such things easily expressible...?

With a dict approach one could also write a generator style

lazyarray = uproot4.lazy({fname: "treename"} for fname in ["batch1/*.root", "batch2/*.root])

or even simplifying it to a single dict:

lazyarray = uproot4.lazy({fname: "treename" for fname in ["batch1/*.root", "batch2/*.root]})

but notice that this is just one dict in contrast to your list of dicts example, so ordering is obviously not guaranteed for Python 3.5 and less, which might confuse one or the other. Anyways, that's just to throw in another point, the general gain is very low and the number of symbols to type is almost the same as above.

nsmith- commented 4 years ago

I don't think using the filename as a key in a dict is so great. What about instead a list of dicts with two specific keys: [{"file_path": fname, "object_path": "tree;1"}, ...]

jpivarski commented 4 years ago

I don't think using the filename as a key in a dict is so great. What about instead a list of dicts with two specific keys: [{"file_path": fname, "object_path": "tree;1"}, ...]

But then you'd have to remember the spellings of those keys: "file_path" and "object_path". They become part of the API that has to be remembered or looked up. If I'm in a LISPy mood, I'd be inclined to expect hyphens, rather than underscores, and maybe the whole second word is unnecessary. What you suggest would be great for a REST API, but it's pretty verbose for Python interaction.

Also, it prevents multiple files in one dict, since there can only be one "file_path" key.

jpivarski commented 4 years ago

These are tests of the dict method: https://github.com/scikit-hep/uproot4/blob/96e99bc09ac45b1c306b54855e1b36060f23e096/tests/test_0081-dont-parse-colons.py

Examples look like this:

one_file = uproot4.open("filename.root")
one_tree = uproot4.open("filename.root")["treename"]
one_tree = uproot4.open({"filename.root": "treename"})

many_trees = uproot4.lazy("filenames*.root")                 # if all files have exactly one TTree
many_trees = uproot4.lazy({"filenames*.root": "treename"})   # if they don't

many_trees = uproot4.lazy(["dir1/*.root", "dir2/*.root"])    # if all files have exactly one TTree
many_trees = uproot4.lazy({"dir1/*.root: "treename", "dir2/*.root": treename})       # if they don't
many_trees = uproot4.lazy([{"dir1/*.root: "treename"}, {"dir2/*.root": treename}])   # careful about order

many_trees = uproot4.lazy([already_open_tree1, already_open_tree2, ...])   # objects like one_tree above

I don't like magical things like guessing which TTree you mean, but it has been requested a user. (Don't ask me to find the reference: it's hard to search for these things.) However, it only guesses which TTree when there is exactly one in the file (in all nested TDirectories), and if two kinds of files are accidentally mixed, it would only be successful if the requested TBranch names also overlap, which would be an even bigger coincidence than having different TTrees with the same name. ("Events" is a very popular name.)

This reminds me that I was also asked (in Uproot 3, in the mists of time) to make iteration skip over any files that are missing the requested TTree. That's a bit more dangerous, but the motivation is to not have to weed out badly written files. (Not what I'd call a careful analysis, but maybe a first look at the data.) Rather than porting that feature into Uproot 4, I've made it optional: allow_missing=False is the default, but if anyone has that problem again, I can point them to this option so they can opt into a more relaxed view.

jpivarski commented 4 years ago

Since there have been a few differing suggestions about what to do here, I'm not going to merge PR #81 immediately. I'll wait a day or two to let others voice their opinions, to see if there is a majority view. (@henryiii, do you have an idea?)

We don't have a formal voting system, but I'll take your suggestions into account (as I did with a lot of the names in Awkward1, where I personally preferred fewer underscores).

henryiii commented 4 years ago

On a quick overview, I like the dict method that's just been merged. If one wanted to provide a colon based CLI interface, it would be easy to write:

# arg might be filename or filename:treename or filename:treename:
if len(x := arg.split(":",1)) > 1:
    arg = {x[0]: x[1]}

(which obviously wouldn't work on filenames with colons, but that would not be inside uproot, either)

matthewfeickert commented 4 years ago

But ATLAS file names have colons in them? That's an important case.

Well, I guess I should be more careful with my words so that I don't get anyone blamed for anything. It isn't that ATLAS software explicitly has colons in filepaths. However, one of the most widely used analysis frameworks that people in ATLAS have developed on their own is xAODAnaHelpers, and the following xAH_run.py example command

xAH_run.py \
    --config config/config.py \
    --files data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061 \
    --inputRucio \
    --nevents 10000 \
    --force \
    direct

results in an output file directory structure that includes (given the options in the config/config.py) a file at the path of

sumbitDir/data-tree/data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root

So it can/does happen (@kratsg please jump in and correct anything I've misrepresented).

I'll wait a day or two to let others voice their opinions, to see if there is a majority view.

Thanks. I don't have very strong API views at the moment (but I'll try to actually think on it now). cc the rest of pyhf for this (@lukasheinrich, @kratsg) and also @alexander-held and @masonproffitt.

kratsg commented 4 years ago

fwiw, ROOTs TFile::Open seems to be able to handle colon in the filename and in the paths after the fact pretty well. You could split this up into trying to be smarter about things automatically:

check if file is valid path before splitting
check if colon split from the right is a valid path

then loudly complain you can't find the file. At least you handle both cases this way... but I'm still not really sure how many people use the :/path/to/obj functionality (it does exist in ROOT, but it doesn't seem like a common feature...)

jpivarski commented 4 years ago

I'm still not really sure how many people use the :/path/to/obj functionality (it does exist in ROOT, but it doesn't seem like a common feature...)

Really? I had no idea! I guess that strengthens the case that colon is a natural character to use—more easily guessable—if it was independently chosen. (Do they require the leading slash? That effectively makes the delimiter ":/", which is less likely to conflict in real paths—you'd need a directory to end with a colon.)

About checking both with and without object path to see if the file exists, the remote protocols (HTTP and XRootD) would require an extra round trip for that. Also, what if (rare case, admittedly) both exist?

Earlier today, I wanted to remove the colon thing altogether because simple is better than clever. However, there are those functions that take many files, or more specifically, many file/tree combinations. The driving consideration is finding a syntax that simplifies the specification of many files/trees, and the single file/tree should be a special case of that. The dict thing works, but maybe it's strange looking or not guessable? If there are better ways to express many files/trees, let me know.

alexander-held commented 4 years ago

I find the existing colon functionality very convenient. One use case I have in mind are frameworks for histogram production. Users need to specify where all the inputs to their histograms are found. The existing syntax allows to do that in just one string, users could e.g. write

mc_samples/{SAMPLE_TYPE}.root:{SYSTEMATIC_VARIATION}

and specify a list of sample types and systematic variations to fill in the placeholders. If the paths inside and outside the ROOT file need to be specified separately, two such strings would be needed. Having it all in one string also reinforces the idea that for the structure of the data it does not matter much whether something is in a directory or a TDirectory.

If one takes the above example but puts everything into a big root file, the string would look like

mc_samples.root:{SAMPLE_TYPE}/{SYSTEMATIC_VARIATION}

and it is quickly evident that the structure of content is the same. This is harder to parse when split up.

Since it came up above: I've had to deal with file structures before where folder names ended with .root (and those folders contained ROOT files), but due to the name the folders mistakenly were interpreted as a ROOT file (this was with ROOT, not uproot). Relying on this to parse might not always work.

kratsg commented 4 years ago

Ok, it seems I've mispoken and what I'm remembering is a feature that hadd supports but that TFile and TFile::Open does not:

hadd target_path.root:/pathtomerge in1.root in2.root

will only hadd-merge everything under that path. I tried googling to find an actual example, but alas, cannot... In any case, the TFile documentation indicates that they still strongly recommend suffixing with .root at the end of the file path... not sure why, but they rely on TUrl to parse all file paths given.

Anyway, if it were me, I would want pathlib.Path(...) to be treated as a file path without parsing for the colon to get around this, rather than treating it as a string (which the current behavior seems to be).

jpivarski commented 4 years ago

There have been quite a few comments about the colon syntax being useful; one on Slack. Maybe it needs to be reinstated. (The syntax issues with URLs and Windows drive letters have been resolved; what remains is @matthewfeickert's issue about it being an unpleasant surprise when you don't know about it.)

What do people think about allowing the colon syntax but requiring the object path to start with a /, and hence the delimiter is effectively :/, rather than just :?
What about allowing both strings/Paths, which parse colons, and the dict syntax shown above, which do not? The dict syntax could then be the way to escape colon parsing, instead of parse_colons=False parameter.

I can also add a detailed explanation of the syntax in all "file not found" error messages (local files, HTTP, etc.). If the colon parsing fails, the very likely error will be file not found, so that's when and how users can be informed. (Again, we don't assume everybody's read the documentation!)

henryiii commented 4 years ago

By the way, unrelated, but I finally understand why vi my_broken_file.py:123 doesn't automatically try to open the file and go to line 123 (+123 does, FYI); it's a valid filename on some systems.

I like option 2, I think.

jpivarski commented 4 years ago

I like option 2, I think.

Which one is option 2?

henryiii commented 4 years ago

Colon parsing by default, but not in the dict syntax. Assumes that colons in filenames are somewhat rare, which I expect they are?

kratsg commented 4 years ago

Colon parsing by default, but not in the dict syntax. Assumes that colons in filenames are somewhat rare, which I expect they are?

rucio relies on scopes for filenames, and scopes are separated with colons: mc15_13TeV:{filename} and when downloading, will include the scope as well. Dataset discovery relies on these scopes.

kpedro88 commented 4 years ago

Repeating comment from Slack: I do use :/path/to/tree fairly often (when the CMSSW TFileService is used to create a TTree, that TTree resides in a TDirectory).

jpivarski commented 4 years ago

Assumes that colons in filenames are somewhat rare, which I expect they are?

I made the same assumption, based on some web-research. I saw somewhere that colons aren't allowed at all in MacOS filenames, but @kratsg said that this restriction only applied to MacOS 9 and below, and possibly OSX in a compatibility mode.

(I remember MacOS 2‒9 used : instead of / as a delimiter between directories, so the character was not allowed for the same reason that / is not allowed in filenames of modern filesystems. MacOS 1 didn't have directories.)

Well, I guess I should be more careful with my words so that I don't get anyone blamed for anything. It isn't that ATLAS software explicitly has colons in filepaths. However, one of the most widely used analysis frameworks that people in ATLAS have developed on their own is xAODAnaHelpers...

...which leads naturally to colons in filenames, and that's how this all came up. @kratsg just mentioned above that Rucio can put colons in filenames, so we'll probably start seeing it in CMS, too.

I'm leaning toward this "option 2" as well. I think the main open question for me is whether to require a / after the :, such that :/ is the true delimiter.

Probably the most important thing is that the precise rules are printed to the screen in every "file not found" error message, so that people can adjust after a first attempt.

masonproffitt commented 4 years ago

Colon parsing by default, but not in the dict syntax. Assumes that colons in filenames are somewhat rare, which I expect they are?

rucio relies on scopes for filenames, and scopes are separated with colons: mc15_13TeV:{filename} and when downloading, will include the scope as well. Dataset discovery relies on these scopes.

Right, this is exactly what I'm familiar with in dealing with my own analysis files. I think it's important for a filename like foo:bar.root to work in the standard string syntax. This is a very common case (for ATLAS at least). I like dropping the colon parsing entirely myself, but I think requiring :/ is probably still good enough.

jpivarski commented 4 years ago

Since the colon parsing is a convenience and gaps in many-file functions can be filled in with the dict syntax, how about if it only works in limited circumstances, such as after .root? (@tamasgal originally made this suggestion.) So,

/path/to/filename.root:path/to/ttree picks out an object at path/to/ttree (requiring .root makes requiring a leading / in the object path superfluous, since it's already quite unambiguous, and it's a potential "gotcha")
/path/to/file:name.root would be interpreted as a file, not a "name.root" object path within a file, since "file" doesn't end in ".root"
/path/to/file:name.root:path/to/ttree also picks out an object at path/to/ttree from a file named file:name.root; earlier colons are part of the filename
use the dict syntax if you really need to ensure that colons are never parsed; {"/path/to/file.root:haha-this-is-the-filename.root": None} won't parse the colon and won't pick out an object within the file because of the None value
use the dict syntax if you need to pick out objects and your files do not end with .root, or you're using wildcards that don't end in .root (a likely example: /path/to/directory/*). This is for functions that need to find many files, such as uproot4.iterate, uproot4.concatenate, and uproot4.lazy.

The colon parsing will be much simpler, based on the regex \.root\s*:\s*, because I'll never need to worry about Windows drive letters or URL schemas with the \.root disambiguator.

Thumbs-up this comment if you're in favor, write another comment below if you're not; I'm going to start coding up this solution and if we're in general agreement, I'll merge it.

masonproffitt commented 4 years ago

There's still a problem with requiring .root:. Another thing that I often see (again from rucio) is filenames ending with a number like .root.1. For a random public example, see https://twiki.cern.ch/twiki/bin/view/Sandbox/GridNotes.

kratsg commented 4 years ago

@jpivarski , I think you shouldn't change the parsing too much to try and support so much. What I think should happen is the following:

str or bytes passed in: be "smart" and parse out colons and figure out what the user wants
pathlib.Path -- assume this is openable via path.open() instead of trying to convert to string and parsing
file pointer (an already opened file) -- just load that up, no need to call open etc...

Is this not at all possible to do? Then if users really start having more complicated files that, for whatever reason, is borking the magic of uproot, then can be more specific step by step, going from str to Path to a filepointer object. The nice thing with that last option, I hope, is that one can use BytesIO as input as well...

jpivarski commented 4 years ago

For files ending in .root.1, etc., I would have thought you're out of the easy case and have to start using dicts. The error message would say that. Would that be a big imposition?

I hadn't considered having different behavior for str/bytes than for pathlib.Path. Would that be more expected?

Also, on accepting an already open Python file-handle (loosely interpreted as anything with read and seek, like io.BytesIO), Uproot has never had this capability, but it would be possible. It would exclude parallel reading of baskets, but you get what you pay for. That could become a new Source subclass, but it's beyond the scope of this issue.

Since there are no thumbs up on the previous proposal and two objections thereafter, how about the following:

No consideration for whether a .root substring is found.
If a str or bytes is passed, do the colon parsing as before (many people have said that they want it, though it can be a stumbling block, as in @matthewfeickert's case that started this thread).
If a pathlib.Path is passed, do not do colon parsing, which is a good way of enforcing what you want in uproot4.open, but would be limiting in functions that need to open many file-tree combinations, such as uproot4.iterate, uproot4.concatenate, and uproot4.lazy.
Support the dict syntax, which also does not do colon parsing, since the key-value separation distinguishes between file path and object path. This is the most general way to express a file path and object path, particularly useful for expressing many flie-tree combinations, and its limit for uproot4.open is a length-1 dict.
All "file not found" errors explain the rules clearly with examples.

What does everybody think of this? Thumbs up or comment: thanks!

masonproffitt commented 4 years ago

Is there something wrong with using :/ as the delimiter? That seems like the easiest way to (I think) satisfy everyone.

In any case, I'm confused as to what the dict syntax is when you just want to open a file with a colon in the filename. Is it:

uproot.open({'foo:bar.root': None})?
uproot.open({'foo:bar.root': ''})?
uproot.open({'foo:bar.root': '/'})?

or some/all of the above? Yikes. Actually, I guess it can't even be '', since that's a valid key. My problem here is that this suddenly makes something as simple as opening a file more confusing and results in inconsistent behavior between different filenames. It's very not beginner-friendly, as you now need to know about dictionaries and their syntax and a bit about TDirectory before you can even look at anything in the ROOT file.

jpivarski commented 4 years ago

(I've actually been distracted by other work, so just getting back to this now.)

One difficulty with optimizing usability is that "what is natural" or "what a novice might try" is rather speculative. I would expect novices (including me, approaching a system I don't know) to get tripped up by the requirement that the object path has to be expressed as absolute, i.e. that it must start with :/blah and not just :blah, even though it is always going to be evaluated relative to /. (This was confusing in the file URI scheme, for example.) I'm not of a strong opinion about this point, though, because I think the error message is where the wide array of users' guesses as to how it works will get narrowed down to how it actually does work—assuming users read error messages. (Sometimes a problem, but not a deep one if they copy-paste the error message into the bug report for me to read.)

The other difficulty with optimizing usability is that expectations among advanced users become necessities. I hadn't known (I guessed right) that the colon syntax was an expected feature, and so losing it would be a stumbling block for users who expect it from ROOT.

But the thing that's driving the dict syntax is neither of the above. It's driven by the functions that access many file-object (actually, file-TTree) pairs. They need a way—not a convenient way but any way—to express combined filesystem + ROOT object paths for hundreds or thousands of files. If we could have used the colon syntax for that, then that would have been great. But some filesystem paths have colons in them. In principle, a directory name could end with a colon, which would get incorrectly identified as :/. So the many file-object functions need to have a backup syntax for when the string syntax isn't expressive enough, and that's what the dicts are. Only one other suggestion has been made that would be as unambiguous, but it's pretty verbose and introduces more things to remember.

So novice users do not need to know about the dict syntax; it only comes up if they can't express a group of filenames + object paths because the filenames have colons in them, and then it comes up as an error message because when they tried to do the naive thing, it couldn't find the file.

I'm going to implement the last proposal I made. I considered not porting the dict syntax from the functions that need it (uproot4.iterate, uproot4.concatenate, uproot4.lazy) to the function that doesn't really need it (uproot4.open, for which only a length-1 dict would be allowed) because the "file not found" error messages don't know whether they're from an attempt to open a single file or many files, and tracking that provenance through everything so that they can give context-dependent error messages doesn't seem worthwhile. Making the same syntax apply to all functions equally, even if it's not really needed for one of them, is simpler.

In any case, I'm confused as to what the dict syntax is when you just want to open a file with a colon in the filename. Is it:

uproot4.open({'foo:bar.root': None})?

uproot4.open({'foo:bar.root': ''})?

uproot4.open({'foo:bar.root': '/'})?

or some/all of the above?

I think the first and third would work; the second wouldn't for the reason you mentioned. But the preferred way to do it, following @kratsg's advice about pathlib.Path, is

uproot4.open(pathlib.Path("foo:bar.root"))

No colon parsing would be performed on a pathlib.Path, and since this is uproot4.open, there's no reason to bring in any fancy dicts because you don't have to specify a TTree path. For uproot4.iterate, uproot4.concatenate, uproot4.lazy, you do have to specify TTrees because you're iterating over, concatenating, or building lazy arrays from TTrees, not files. There wouldn't be a good reason to make the value in a dict None, though it would work, for consistency's sake.

The error message should not only say what's possible, but should say what's preferred. Now I'm finally going to go and implement that.

jpivarski commented 4 years ago

In the end, the error messages look like this:

FileNotFoundError: file not found

    '/really/long/path/that:in:the:end/is/not-a-file.root'

Files may be specified as:
   * str/bytes: relative or absolute filesystem path or URL, without any colons
         other than Windows drive letter or URL schema.
         Examples: "rel/file.root", "C:\abs\file.root", "http://where/what.root"
   * str/bytes: same with an object-within-ROOT path, separated by a colon.
         Example: "rel/file.root:tdirectory/ttree"
   * pathlib.Path: always interpreted as a filesystem path or URL only (no
         object-within-ROOT path), regardless of whether there are any colons.
         Examples: Path("rel:/file.root"), Path("/abs/path:stuff.root")

Functions that accept many files (uproot4.iterate, etc.) also allow:
   * glob syntax in str/bytes and pathlib.Path.
         Examples: Path("rel/*.root"), "/abs/*.root:tdirectory/ttree"
   * dict: keys are filesystem paths, values are objects-within-ROOT paths.
         Example: {"/data_v1/*.root": "ttree_v1", "/data_v2/*.root": "ttree_v2"}
   * already-open TTree objects.
   * iterables of the above.

scikit-hep / uproot5