hsf-training / hsf-training-scikit-hep-webpage

https://hsf-training.github.io/hsf-training-scikit-hep-webpage/
Other
9 stars 14 forks source link

Basic file I/O with Uproot -> Writing TTrees -> different branch types #46

Open JacekHoleczek opened 11 months ago

JacekHoleczek commented 11 months ago

In the "Basic file I/O with Uproot" chapter, in the "Writing TTrees" section, you show two ways to create/write trees. However, their "x" branches will be different.

This creates an "x/L" branch:

output1["tree1"] = {"x": np.random.randint(0, 10, 1000000), "y": np.random.normal(0, 1, 1000000)}

This creates an "x/I" branch:

output1.mktree("tree2", {"x": np.int32, "y": np.float64})

Maybe it would be a good idea to use:

output1.mktree("tree2", {"x": np.int64, "y": np.float64})

BTW. You do not tell people to execute output1.close() in the end.

welcome[bot] commented 11 months ago

Thanks for opening your first issue here đź’–! If you have any questions, feel free to mention one of the conveners, previous contributors, or attend our weekly meeting (see https://hepsoftwarefoundation.org/workinggroups/training.html). Also, sometimes issues go unnoticed, so don't hesitate to @mention some of us, if we do not come back to you within a few days.

jpivarski commented 11 months ago

Actually, randint probably creates int32 on Windows, but I get your point.

If the branch type (int32) doesn't match the array type, Uproot will convert it. This is stealthily demonstrating that. It could be pointed out explicitly, though it might be too much information.

About output1.close(), it's good practice to always put open files in a with statement in scripts, but that's more cumbersome when working interactively. (This notebook demonstrates an interactive conversation.) Uproot's file-writing has been implemented in such a way that every statement that writes (assignment into a TDirectory, mktree or mkdir, or extend on a TTree) also flushes, so explicitly closing is a formality. (The OS might not have put the data physically on disk, but it presents a view to other processes as though it has.) Thus, it's safe to skip output1.close() for Uproot, though it may be encouraging bad practice for other libraries.

Again, explaining all of that in an introductory tutorial would probably be too much information. I don't have an idea of what is the best pedagogy for these two things, array-type conversion and file closing, whether it's better to explicitly convert array types and explicitly close files, encouraging the assumption that it is necessary to do these things, or add a comment about the fact that they're not necessary, potentially burying the presentation in too many details, or what.

JacekHoleczek commented 11 months ago

Actually, randint probably creates int32 on Windows, but I get your point.

That's a real problem because it creates "incompatible" trees. One should add there an explicit paragraph about it. It should warn people that branches created with "automatically deduced" types may result in different ROOT/C++ types when the same Python macro is used on different machines (which is not the case if one explicitly uses numpy types).

What concerns the output1.close() ... in the" Opening the file" section, you write:

To open a file for reading, pass the name of the file to uproot.open. In scripts, it is good practice to use Python's with statement to close the file when you're done, but if you're working interactively, you can use a direct assignment.

But then you only "open" it, and you do not show how to "close" it in the end (when working interactively).

Of course, if you are sure that uproot always keeps the ROOT file "fully updated", it is unnecessary. My remaining concern is ... what happens if the Python macro is killed (e.g., due to a Ctrl-C or a segmentation violation) ... will the (open) ROOT file be left "intact"?

jpivarski commented 11 months ago

Uproot does convert the data to what was requested in the uproot.WritableDirectory.mktree declaration:

>>> import numpy as np
>>> import uproot
>>> outfile = uproot.recreate("/tmp/whatever.root")
>>> tree2 = outfile.mktree("tree2", {"x": np.int32, "y": np.float32})
>>> tree2.extend({"x": np.random.randint(0, 10, 1000000), "y": np.random.normal(0, 1, 1000000)})
>>> tree2.show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
x                    | int32_t                  | AsDtype('>i4')
y                    | float                    | AsDtype('>f4')

So the mktree-extend method (which requires more boilerplate to set up) provides more control over the output. For instance, it's a better protection against this platform dependency of Windows's NumPy using int32 as its default integer type.

That could be worth a mention, but an introductory tutorial can't get into all of the details without overwhelming first-timers. Perhaps a more efficient way to give new users a good working knowledge is to refer to the

>>> outfile["tree1"] = {"x": np.random.randint(0, 10, 1000000), "y": np.random.normal(0, 1, 1000000)}
>>> outfile["tree1"].show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
x                    | int64_t                  | AsDtype('>i8')
y                    | double                   | AsDtype('>f8')

method as "quick and dirty" and the mktree-extend method as the slow, careful way.

This can also apply to opening and closing files using the with syntax. Explicitly closing files with the close method is an antipattern in Python because it usually allows for code paths in which the file doesn't actually get closed—for instance, if there's an exception before reaching the close.

For some file I/O libraries, this can leave output files in an invalid state, but one of Uproot's deliberate features is that all file-state changing happens in the writing functions (e.g. outfile.mktree or outfile["tree1"] = ... above) and not in the close function. If the ROOT file format required some footer that could only be written once (like Parquet), then the close function would be vital for putting the file in a valid state, but the ROOT format is not like that.

Therefore, for Uproot, the only danger from not closing files is the possibility of running out of open file handles. That applies equally to file handles for reading and file handles for writing.

My remaining concern is ... what happens if the Python macro is killed (e.g., due to a Ctrl-C or a segmentation violation) ... will the (open) ROOT file be left "intact"?

If the Python process ends due to Ctrl-C, a Python exception, or a segfault, then the file that is being written to will be in a valid state if and only if that Ctrl-C, Python exception, or segfault happened between file-writing statements. If it happened in the middle of a file-writing statement, then there's no guarantee, though we do try to make updates in a way that will keep it valid for as long as possible. For instance, if we're adding a new histogram or TTree to the file, we'll write the data for the histogram or TTree first, then declare that block valid in ROOT's TFreeSegments so that ROOT recognizes it as good data, and then add it to the TDirectory. If writing is interrupted partway through this process, you won't see the histogram or TTree in the TDirectory, try to read it, and get garbage; you'll either not see it in the TDirectory or it will be fully formed. But while this is as cautious as possible, some of these operations, like updating a TDirectory, are not atomic and it's still possible to be interrupted in the middle and get an invalid file. For instance, to add a TKey to a TDirectory, hundreds of bytes must be written, and the file-writing could be interrupted in the middle of this step.

So the only categorical statement I can make is that the file is in a valid state if the process is interrupted between file-writing statements. If, for instance, outfile["tree1"] = ... finishes, then Python goes on to do some stuff that is unrelated to updating that file and gets killed somehow, the ROOT file would be in a valid state. Every file-writing statement in Uproot ends by calling file.flush, which gives the operating system responsibility for all the bytes we passed to file.write, and the operating system has to honor that by eventually getting those bytes to disk even if the process is killed.

(The argument against frequent calls to file.flush is that it's a performance cost, but we do it at the end of the user-written Python statement, which hides that latency because Python is slow. I think it's likely that ROOT itself does not flush at the end of each of its file-updating statements, but that would be because those are C++ functions that might be used in a loop that needs to run fast. We lose speed for being in Python; we can trade speed for safety.)

JacekHoleczek commented 11 months ago

I wasn't clear again, sorry . Yes, I meant the situations when "the process is interrupted between file-writing statements". ROOT itself has a dedicated TFile:Recover method (so it should be able to recover a file if not correctly closed).

JacekHoleczek commented 11 months ago

BTW. Of course, in the tutorial, both methods of creating branches should be shown. However, I really think one also needs there a dedicated paragraph (in the tutorial) that explicitly warns people that branches created with "automatically deduced" types may result in different ROOT/C++ types when the same Python macro is used on different machines (which is not the case if one explicitly uses numpy types).

stale[bot] commented 9 months ago

This issue or pull request has been automatically marked as stale because it has not had recent activity. Please manually close it, if it is no longer relevant, or ask for help or support to help getting it unstuck. Let me bring this to the attention of @klieret @wdconinc @michmx for now.