ssl-hep / ServiceX

ServiceX - a data delivery service pilot for IRIS-HEP DOMA
BSD 3-Clause "New" or "Revised" License
20 stars 21 forks source link

ServiceX Year 5 #430

Open BenGalewsky opened 2 years ago

BenGalewsky commented 2 years ago

ServiceX goals for the last year of IRIS-HEP will enable the

  1. Increase Reliability
  2. Improve usability
  3. Increase it's physics reach

We believe that with the requested staffing for year 5 we should be able to achieve these goals.

More specifically:

1. Increase Reliability

a. Support multiple code generator backends with a single ServiceX Instance b. Archive old transform results to manage object store space usage c. Make releases easier and more frequent by reducing the complexity. Issue #431 puts all of the services into a single repo so they can be release from a single branch

There are several smaller issues in the backlog to meet this goal

2. Improve Usability

a. Make the ServiceX Transform requests durable where the results can be regenerated as needed c. Add a synchronous interface to the ServiceX frontend and integrate it with the existing Coffea executors d. Improve error reporting workflow to insure timely and actionable error reports are returned to the user. i.e. #408, #332, #317 e. Integrate with CERN JWT so users can bring their own credentials to runs f. Implement columnar cache to allow users to share transformed results

3. Increase Physics Reach

a. Return systematic variations in the transform result #71 #429 b. Create transformer for CMS MiniAOD c. Create transformer that can extract data from ATLAS open data zip files

msneubauer commented 2 years ago

Isn't 1b more appropriate to allow users to share transformed results? I guess what you mean by "old". I would think 2f is more about performance on subsequent column queries

BenGalewsky commented 2 years ago

Isn't 1b more appropriate to allow users to share transformed results? I guess what you mean by "old". I would think 2f is more about performance on subsequent column queries

So 1b is entirely about managing our resources. Right now transforms just sit in the object store and it eventually fills up. 2a is a bit more about sharing transforms. I can give you my transform ID and even it it got cleaned up last month you are guaranteed to be able to regenerate.

I would certainly appreciate more strategic thinking around 2f. I guess it could be mostly useful for a single analyzer noodling around with their columnar data while they perfect their analysis. I thought part of the "I" in "IDDS" was about making transformed results more reusable for different analyzers.