scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.
https://uproot.readthedocs.io
BSD 3-Clause "New" or "Revised" License
237 stars 76 forks source link

HTTPSource doesn't always switch over to fallback mode. #507

Closed klieret closed 2 years ago

klieret commented 2 years ago

We use this file in our python training (~50 MB, ROOT file).

With uproot3:

import uproot3
uproot3.open("https://rebrand.ly/00vvyzg")["b0phiKs"].pandas.df()

returns a dataframe in ~10s.

With uproot4:

import uproot
uproot.open("https://rebrand.ly/00vvyzg")["b0phiKs"].arrays(library="pd")

times out unsuccessfully after about a minute.

Loading the file from disk after wgeting it works fine in both cases.

You can also use the link https://syncandshare.desy.de/index.php/s/TkCSBQq3QHprFxR/download (that's what the shortened URL points to) instead to get the same results.

uproot3 version: 3.14.4 uproot4 version: 4.0.7 OS: Ubuntu 18.04 bionic

Unfortunately I don't have time to investigate this further on my own right now, but it would already be interesting if this can be reproduced by others (I tested it with two independent installations).

jpivarski commented 2 years ago

This HTTP server requires some special care: it's sending unrecognized responses. You don't see that when loading all 50 fields in one go—failures are hidden by attempts to fall back on different methods and try again, before the whole thing times out.

>>> import uproot
>>> f = uproot.open("https://rebrand.ly/00vvyzg")
>>> f.keys()
['b0phiKs;2', 'b0phiKs;1']
>>> t = f["b0phiKs"]
>>> f.file.source
<HTTPSource '...ly/00vvyzg' at 0x7f83f02755b0>

No problem so far because opening a file and navigating through it only ever asks for one chunk of the file at a time. (That is, a directory listing is in bytes X through Y, so Uproot asks the HTTP server for those bytes, waits until it gets them, then uses what it gets—all synchronously.)

When reading an array, the first attempt gives the HTTP server a list of possibly discontiguous byte ranges, the locations of all the TBaskets, and asks for them all to be returned in one response. This is a "multi-part GET" in HTTP, and not all servers support it. (Not attempting this was a big inefficiency in Uproot 3.)

Let's start with just one TBranch:

>>> t["B0_M"].array()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
...
  File "/home/jpivarski/irishep/uproot4/src/uproot/source/http.py", line 360, in task
    resource.handle_multipart(source, futures, results, response)
  File "/home/jpivarski/irishep/uproot4/src/uproot/source/http.py", line 423, in handle_multipart
    raise OSError(
OSError: found 1 of 27 expected headers in HTTP multipart
for URL https://rebrand.ly/00vvyzg

This server does not support multi-part GETs, but I don't know why Uproot hasn't switched over to using its fallback "one HTTP volley per TBasket" mode yet:

>>> f.file.source
<HTTPSource '...ly/00vvyzg' at 0x7f83f02755b0>

(doesn't say "with fallback").

I tried reading a small number of entries:

>>> t["B0_M"].array(entry_stop=5)
<Array [5.02, 5.11, 5.12, 5.36, 5.3] type='5 * float32'>

Maybe that worked because these are all in the first TBasket, so therefore only one byte range is being sent from the HTTP server to Uproot? How many baskets are there and how big is the first one?

>>> t["B0_M"].num_baskets
28
>>> t["B0_M"].basket_entry_start_stop(0)
(0, 7981)

So the first 7981 entries are all in the first TBasket. Reading up to that point should be no problem.

>>> t["B0_M"].array(entry_stop=7981)
<Array [5.02, 5.11, 5.12, ... 5.18, 5.45, 5.15] type='7981 * float32'>

Presumably, we should start to see the problem when reading past that point, which asks for two TBaskets (discontiguous byte ranges in the file). But it doesn't:

>>> t["B0_M"].array(entry_stop=7982)
<Array [5.02, 5.11, 5.12, ... 5.45, 5.15, 5.14] type='7982 * float32'>
>>> t["B0_M"].array(entry_stop=14000)
<Array [5.02, 5.11, 5.12, ... 5.1, 5.07, 5.3] type='14000 * float32'>
>>> t["B0_M"].array(entry_stop=28000)
<Array [5.02, 5.11, 5.12, ... 5.13, 5.37, 5.15] type='28000 * float32'>

Even if we read all 28 TBaskets in this TBranch:

>>> t["B0_M"].array()
<Array [5.02, 5.11, 5.12, ... 5.25, 5.25, 5.26] type='329135 * float32'>

But that's because Uproot's HTTP client is already in fallback mode:

>>> f.file.source
<HTTPSource '...ly/00vvyzg' with fallback at 0x7f83f02755b0>

I shut down Python and tried again, and it seems that Uproot correctly goes into fallback mode if entry_stop is given but not if it isn't. That's an Uproot bug.

But the bigger issue for you is that you can't use multi-part GET with this HTTP server. Knowing that, you can go directly into fallback mode: uproot.HTTPServer attempts the multi-part GET with a fallback, uproot.MultithreadedHTTPSource is that fallback: it launches a suite of worker threads and has each one open a single HTTP connection. Too few workers and you'll be latency dominated, waiting for the server to individually respond for each TBasket of data, too many workers and you'll clog the server—it will get all the requests at once and will need to manage that. Both of these considerations depend on the quality of the network between you and the server and the quality of the server, but you can tune the num_workers. (See uproot.open. If you're lucky and most of the directory data and TTree metadata is at the beginning of the file, you can increase begin_chunk_size to preemptively read more and—with luck—not have to make several volleys to get directory data.)

>>> import uproot
>>> f = uproot.open("https://rebrand.ly/00vvyzg", http_handler=uproot.MultithreadedHTTPSource, num_workers=10)
>>> f.file.source
<MultithreadedHTTPSource '...ly/00vvyzg' (10 workers) at 0x7ff5438545b0>
>>> t = f["b0phiKs"]
>>> t["B0_M"].array()
<Array [5.02, 5.11, 5.12, ... 5.25, 5.25, 5.26] type='329135 * float32'>
>>> import time
>>> starttime = time.time(); df = t.arrays(library="pd"); print(time.time() - starttime)
160.00439739227295
>>> df
        exp_no  run_no    evt_no      B0_M  ...   B0_hoo3   B0_hoo4  nCands  iCand
0            0       0  23045033  5.024452  ...  0.004035 -0.003992       1      0
1            0       0  23046788  5.107935  ... -0.001334  0.000621       2      0
2            0       0  23046788  5.119214  ... -0.001160 -0.000464       2      1
3            0       0  23046989  5.361363  ... -0.002333  0.019902       1      0
4            0       0  23049068  5.301050  ...  0.002522  0.001435       1      0
...        ...     ...       ...       ...  ...       ...       ...     ...    ...
329130       0       0  71575996  5.093135  ... -0.007649  0.031542       1      0
329131       0       0  71646148  5.096963  ... -0.000318  0.004670       1      0
329132       0       0  71647975  5.250312  ...  0.000000  0.057552       1      0
329133       0       0  71649052  5.250185  ...  0.017649  0.021859       1      0
329134       0       0  71649130  5.262949  ...  0.003130  0.004816       1      0

[329135 rows x 50 columns]

Two and a half minutes for me. I watched this with iftop and see all the traffic from desycloud.desy.de. Oh, hi Kilian!

klieret commented 2 years ago

Hi @jpivarski ! Wow, thanks a lot for taking me through the whole journey of getting to the bottom of this! This has been a very interesting read :)
Also good to know of the limitations of this server! It's not really used for data files usually (it's only used in this tutorial to make it more portable for users who don't have their official accounts yet), and now we know it really shouldn't be!

jpivarski commented 2 years ago

it seems that Uproot correctly goes into fallback mode if entry_stop is given but not if it isn't. That's an Uproot bug.

If entry_stop is given and is small enough, then the request is only one TBasket and therefore one part of a multi-part GET. This HTTP server responds correctly to that case. So the above-mentioned bug can be triggered by

Instead of raising an error during the handling of the HTTP:

https://github.com/scikit-hep/uproot4/blob/22407793f9eab02ce73d6843bb3395bedacab50b/src/uproot/source/http.py#L401-L406

now we just switch over to the fallback:

https://github.com/scikit-hep/uproot4/blob/0268c4b652b1dcc6e89c5656052bd2e221b4db23/src/uproot/source/http.py#L406-L407

I was able to read your data, but it's hard to put this into a testing suite, since it relies on an external site. (And it relies on the external HTTP server not being improved in the future.)

Fixed in #594.