ssl-hep / ServiceX

ServiceX - a data delivery service pilot for IRIS-HEP DOMA
BSD 3-Clause "New" or "Revised" License
19 stars 21 forks source link

Migrate Transform Manager To Its Own Microservice #792

Open BenGalewsky opened 1 month ago

BenGalewsky commented 1 month ago

Problem

Several critical high frequency operations related to transforms are made using REST calls through the single threaded flask app. This has several problems:

  1. Messages can be lost when the flask server is busy
  2. Microservices and the client have to implement retry logic in all of their interactions with the app
  3. We need to scale up instances of the app even though much of its time is spent blocked on I/O operations
  4. The app has become quite complicated since it provides so many services

Approach

Migrate the transformer manager functionality out of the app and into a message driven microservice. Microservice architecture calls on a clear separation of database concerns within each service. We will respect this by having the app only read transform related tables, while the new transformer manager service will perform all of the writes on:

Assumptions

  1. We will implement the new Transformer Manager microservice using the Celery framework
  2. The SQL Alchemy models will need to be shared between the app and the transformer manager. They will be shared via a new directory at the top level of the ServiceX monorepo so we don't have to introduce a new library to our build pipeline
  3. The celery task definitions must be shared between the TranformManager service and the other services. We will use the signatures feature of Celery to accomplish this.
  4. The transform submit REST POST operation will be hollowed out so it returns quickly. It will perform some high level validation, generate the request ID and put the record on a queue before returning the request ID back to the client.
  5. The DID finder library will be modified to use Celery tasks instead of PUTs to the app
  6. The transformer sidecar will be modified to use Celery tasks instead of PUTs to the app
gordonwatts commented 3 weeks ago

This looks wonderful - we've been talking around the edges of this, and I like how you've clearly drawn a boarder around what should be extracted here.

One thing I'd add - there is no mention of getting error messages back to the user when various things fail. Before the re-write above occurs, it would be good to at least understand how that will work. Losing, for example, errors from the codegen part would make life a lot worse. This might mean really thinking about this and the database and having some clear internal rules for how errors are reported.

I don't now that this has to be part of this work - but this work has the potential to make the UX worse than it is now.