Fix squonk job execution

mwinokan commented 3 weeks ago

@tdudgeon and @alanbchristie please give us an estimate for the amount of work needed to update the deployments to use the new data model and resuscitate the algorithms

mwinokan commented 1 week ago

@tdudgeon says that the work needed depends largely on whether the shared filesystem between Fragalysis and Squonk is needed.

@phraenquex says that the first step is to get the job submission working again and that conversations with SC have ruled out a shared file system between the Fragalysis stack and thread for job execution

@alanbchristie A schematic of the architecture for Fragalysis and Squonk would help immensely

alanbchristie commented 4 days ago

Here are some references for the original design: -

The original high-level design for Squonk Job Execution in Fragalysis: - With references to various google docs https://github.com/m2ms/fragalysis-frontend/issues/839

A simplified low-level design document (Fragalysis/Squonk AC (937) LLD) https://docs.google.com/document/d/1lFpN29dK1luz80lwSGi0Rnj1Rqula2_CTRuyWDUBu14/edit?tab=t.0#heading=h.y56jkt6kfj5y

A brief but practical documentation on how to use Job Execution on the b/e ReadTheDocs site: - https://fragalysis-stack-kubernetes.readthedocs.io/en/latest/architecture/index.html See the section: - Squonk Integration

An issue discussing the global Job Configuration https://github.com/m2ms/fragalysis-frontend/issues/944

Other related issues include: -

Fix Squonk access control An early and detailed discussion of how Fragalysis and Squonk objects are related https://github.com/m2ms/fragalysis-frontend/issues/937 Implement job status refresh Relates to availability of a token and a refresh mechanism in the F/E An attempt to review job status that may have bene lost (due to callback errors) https://github.com/m2ms/fragalysis-frontend/issues/872

API

job_request: https://fragalysis.xchem.diamond.ac.uk/api/job_request/ job_file_transfer: https://fragalysis.xchem.diamond.ac.uk/api/job_file_transfer/ job_callback: https://fragalysis.xchem.diamond.ac.uk/api/job_callback/ job_config: "https://fragalysis.xchem.diamond.ac.uk/api/job_config/ job_override: https://fragalysis.xchem.diamond.ac.uk/api/job_override/ job_access: https://fragalysis.xchem.diamond.ac.uk/viewer/job_access/

Views (in viewer.views)

JobRequestView [GET, POST]
JobFileTransferView [GET, POST]
JobCallbackView [GET, PUT] (PUT used by Squonk)
JobConfigView [GET]
JobOverrideView [GET, POST]
JobAccessView [GET]

Basic flow

F/E initiates a File Transfer Providing: -
- Target Access String (Project/visit)
- Target
- Snapshot
- Session Project
- Proteins
- Compounds The B/E then transfers files to Squonk using the User and Target as the basis for the Project name. Each Stack also has adds a unique prefix so that projects from Alan on Stack X are different from those on Stack Y. A typical Squonk Project name will be Fragalysis fs-abc achristie::lb18141-1
F/E initiates a Job by using Job Request Providing: -
- Target Access String (Project/visit)
- Target
- Snapshot
- Session Project
- Squonk Job Name
- Squonk Job Spec
  1. Squonk calls into Fragalysis using the Job Callback API
  2. The B/E responds to callbacks that indicate 'completion' by fetching the generated file from Squonk's filesystem and then uploading it as a new "computed/compound set'.
  3. F/E polls Job Request until complete

Transferring Target files (to squonk)

After a lot of "preamble" logic the JobFileTransferView: -

Creates a FileTransfer record
Runs the Transfer logic as a Celery task. This is accomplished by process_job_file_transfer.delay(...) in tasks.py, which is given the ID of the JobTransfer record. The record contains JSON record listing the proteins and compounds required (amongst other things).
Ultimately this code relies on logic in squonk_job_file_transfer.py to locate the files and then use the Data Manager API to send them to the Squonk filesystem (using its put_unmanaged_project_files() API method).

Uploading Job results (from squonk)

The JobCallbackView received notifications about completed Jobs. When a job is complete: -

A number of "chained" Celery tasks are executed: -
- process_compound_set_job_file(...)
- validate_compound_set()
- process_compound_set()
- erase_compound_set_job_material()

These four functions retrieve the SD file (and JSON parameters) from Squonk and 'format' the content to better fit Fragalysis, then simply validate and process the file (using MolOps) to create a new "compound set" before removing the transient files.

Effort

If Squonk Job execution remains the same, as does the configuration, then the only things broken will be the file transfer's ability to locate the files. We think the RHS upload will still work, but might need an adjustment. In summary: -

0.5 days to reconfigure squonk in the new cluster and make sure its API works
2 days to adjust the file transfer code, test and release the new logic

Maybe 2-3 days?

mwinokan commented 1 day ago

@alanbchristie asks if the f/e feature for starting squonk jobs is still there

@boriskovar-m2ms says it's there but won't work with the new data format, and expects there to be several issues

m2ms / fragalysis-frontend