Closed klieret closed 2 years ago
This HTTP server requires some special care: it's sending unrecognized responses. You don't see that when loading all 50 fields in one go—failures are hidden by attempts to fall back on different methods and try again, before the whole thing times out.
>>> import uproot
>>> f = uproot.open("https://rebrand.ly/00vvyzg")
>>> f.keys()
['b0phiKs;2', 'b0phiKs;1']
>>> t = f["b0phiKs"]
>>> f.file.source
<HTTPSource '...ly/00vvyzg' at 0x7f83f02755b0>
No problem so far because opening a file and navigating through it only ever asks for one chunk of the file at a time. (That is, a directory listing is in bytes X through Y, so Uproot asks the HTTP server for those bytes, waits until it gets them, then uses what it gets—all synchronously.)
When reading an array, the first attempt gives the HTTP server a list of possibly discontiguous byte ranges, the locations of all the TBaskets, and asks for them all to be returned in one response. This is a "multi-part GET" in HTTP, and not all servers support it. (Not attempting this was a big inefficiency in Uproot 3.)
Let's start with just one TBranch:
>>> t["B0_M"].array()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
File "/home/jpivarski/irishep/uproot4/src/uproot/source/http.py", line 360, in task
resource.handle_multipart(source, futures, results, response)
File "/home/jpivarski/irishep/uproot4/src/uproot/source/http.py", line 423, in handle_multipart
raise OSError(
OSError: found 1 of 27 expected headers in HTTP multipart
for URL https://rebrand.ly/00vvyzg
This server does not support multi-part GETs, but I don't know why Uproot hasn't switched over to using its fallback "one HTTP volley per TBasket" mode yet:
>>> f.file.source
<HTTPSource '...ly/00vvyzg' at 0x7f83f02755b0>
(doesn't say "with fallback").
I tried reading a small number of entries:
>>> t["B0_M"].array(entry_stop=5)
<Array [5.02, 5.11, 5.12, 5.36, 5.3] type='5 * float32'>
Maybe that worked because these are all in the first TBasket, so therefore only one byte range is being sent from the HTTP server to Uproot? How many baskets are there and how big is the first one?
>>> t["B0_M"].num_baskets
28
>>> t["B0_M"].basket_entry_start_stop(0)
(0, 7981)
So the first 7981 entries are all in the first TBasket. Reading up to that point should be no problem.
>>> t["B0_M"].array(entry_stop=7981)
<Array [5.02, 5.11, 5.12, ... 5.18, 5.45, 5.15] type='7981 * float32'>
Presumably, we should start to see the problem when reading past that point, which asks for two TBaskets (discontiguous byte ranges in the file). But it doesn't:
>>> t["B0_M"].array(entry_stop=7982)
<Array [5.02, 5.11, 5.12, ... 5.45, 5.15, 5.14] type='7982 * float32'>
>>> t["B0_M"].array(entry_stop=14000)
<Array [5.02, 5.11, 5.12, ... 5.1, 5.07, 5.3] type='14000 * float32'>
>>> t["B0_M"].array(entry_stop=28000)
<Array [5.02, 5.11, 5.12, ... 5.13, 5.37, 5.15] type='28000 * float32'>
Even if we read all 28 TBaskets in this TBranch:
>>> t["B0_M"].array()
<Array [5.02, 5.11, 5.12, ... 5.25, 5.25, 5.26] type='329135 * float32'>
But that's because Uproot's HTTP client is already in fallback mode:
>>> f.file.source
<HTTPSource '...ly/00vvyzg' with fallback at 0x7f83f02755b0>
I shut down Python and tried again, and it seems that Uproot correctly goes into fallback mode if entry_stop
is given but not if it isn't. That's an Uproot bug.
But the bigger issue for you is that you can't use multi-part GET with this HTTP server. Knowing that, you can go directly into fallback mode: uproot.HTTPServer
attempts the multi-part GET with a fallback, uproot.MultithreadedHTTPSource
is that fallback: it launches a suite of worker threads and has each one open a single HTTP connection. Too few workers and you'll be latency dominated, waiting for the server to individually respond for each TBasket of data, too many workers and you'll clog the server—it will get all the requests at once and will need to manage that. Both of these considerations depend on the quality of the network between you and the server and the quality of the server, but you can tune the num_workers
. (See uproot.open. If you're lucky and most of the directory data and TTree metadata is at the beginning of the file, you can increase begin_chunk_size
to preemptively read more and—with luck—not have to make several volleys to get directory data.)
>>> import uproot
>>> f = uproot.open("https://rebrand.ly/00vvyzg", http_handler=uproot.MultithreadedHTTPSource, num_workers=10)
>>> f.file.source
<MultithreadedHTTPSource '...ly/00vvyzg' (10 workers) at 0x7ff5438545b0>
>>> t = f["b0phiKs"]
>>> t["B0_M"].array()
<Array [5.02, 5.11, 5.12, ... 5.25, 5.25, 5.26] type='329135 * float32'>
>>> import time
>>> starttime = time.time(); df = t.arrays(library="pd"); print(time.time() - starttime)
160.00439739227295
>>> df
exp_no run_no evt_no B0_M ... B0_hoo3 B0_hoo4 nCands iCand
0 0 0 23045033 5.024452 ... 0.004035 -0.003992 1 0
1 0 0 23046788 5.107935 ... -0.001334 0.000621 2 0
2 0 0 23046788 5.119214 ... -0.001160 -0.000464 2 1
3 0 0 23046989 5.361363 ... -0.002333 0.019902 1 0
4 0 0 23049068 5.301050 ... 0.002522 0.001435 1 0
... ... ... ... ... ... ... ... ... ...
329130 0 0 71575996 5.093135 ... -0.007649 0.031542 1 0
329131 0 0 71646148 5.096963 ... -0.000318 0.004670 1 0
329132 0 0 71647975 5.250312 ... 0.000000 0.057552 1 0
329133 0 0 71649052 5.250185 ... 0.017649 0.021859 1 0
329134 0 0 71649130 5.262949 ... 0.003130 0.004816 1 0
[329135 rows x 50 columns]
Two and a half minutes for me. I watched this with iftop and see all the traffic from desycloud.desy.de. Oh, hi Kilian!
Hi @jpivarski ! Wow, thanks a lot for taking me through the whole journey of getting to the bottom of this! This has been a very interesting read :)
Also good to know of the limitations of this server! It's not really used for data files usually (it's only used in this tutorial to make it more portable for users who don't have their official accounts yet), and now we know it really shouldn't be!
it seems that Uproot correctly goes into fallback mode if
entry_stop
is given but not if it isn't. That's an Uproot bug.
If entry_stop
is given and is small enough, then the request is only one TBasket and therefore one part of a multi-part GET. This HTTP server responds correctly to that case. So the above-mentioned bug can be triggered by
entry_stop=1
orentry_stop
larger than the first TBasket, which is 7981 for "B0_M"
.Instead of raising an error during the handling of the HTTP:
now we just switch over to the fallback:
I was able to read your data, but it's hard to put this into a testing suite, since it relies on an external site. (And it relies on the external HTTP server not being improved in the future.)
Fixed in #594.
We use this file in our python training (~50 MB, ROOT file).
With
uproot3
:returns a dataframe in ~10s.
With
uproot4
:times out unsuccessfully after about a minute.
Loading the file from disk after
wget
ing it works fine in both cases.You can also use the link
https://syncandshare.desy.de/index.php/s/TkCSBQq3QHprFxR/download
(that's what the shortened URL points to) instead to get the same results.uproot3
version:3.14.4
uproot4
version:4.0.7
OS: Ubuntu 18.04 bionicUnfortunately I don't have time to investigate this further on my own right now, but it would already be interesting if this can be reproduced by others (I tested it with two independent installations).