iris-hep / func_adl_servicex

Send func_adl expressions to a ServiceX endpoint
0 stars 4 forks source link

Making Data access more uniform #55

Open gordonwatts opened 1 year ago

gordonwatts commented 1 year ago

It would be nice to see a more carefully thought out and straight forward way to ask for data to come back from ServiceX. In short, normalize the access patterns for servicex. The current interface has grown organically, and there are now so many operations and it is hard to surface them from one place to the other. Time to take a step back, perhaps.

What we have now

        sx = ServiceXDataset([uproot_single_file],
                             backend_name=endpoint_uproot,
                             status_callback_factory=None)
        src = ServiceXSourceUpROOT(sx, 'mini')
        r = (src.Select(lambda e: {'lep_pt': e['lep_pt']})
                .AsAwkwardArray()
                .value())

And AsAwkwardArray can be replaced by a bunch of different things:

These methods do not return the actual data - just the request to generate the data. The value() call at the end actually triggers the infrastructure to generate the data. There is another version of the method called value_async() that does the same thing, but allows you to easily queue up many requests at once.

There are at least two axes here:

There is yet another axis for the root and parquet queries - do you want the files downloaded locally into a cache or just a uri to access them over the web? This is only accessible via direct calls to the servicex library (e.g. see get_root_files_async, get_root_files_stream, and get_data_rootfiles_uri_stream and get_data_rootfiles_uri_async).

What do users of func_adl want?

Let's look at each one and reason about why different choices are made.

Starting from Scratch