Spec and budget for FEGrow algorithm deployment

mwinokan commented 1 month ago

The developers of FEGrow a fragment expansion and scoring algorithm (actively used in ASAP), are applying to a grant and seeking to cover the costs of deploying the algorithm to Fragalysis.

Here is the brief:

Helping to deploy a containerised FEgrow workflow, including new scoring methods, to our Fragalysis platform to enable the XChem user community to elaborate selected fragment structures, and to visually review the outputs in comparison with inspiration fragments. We would be particularly enthusiastic about methods to identify atoms / functional groups that drive binding, and those that are amenable to substitution.

I have outlined a few key items of work needed for this:

(b/e IM) Deployment of containerised FEGrow compound design workflow(s) as Squonk jobs
(b/e IM) Deployment of containerised FEGrow compound scoring workflow(s) as Squonk jobs
(b/e IM) Registration of FEGrow outputs in the database (new compound designs & associated scores)
(b/e IM) Registration of FEGrow per-atom/interaction scoring in the database
(f/e M2M) Highlighting of atoms/groups that are predicted to drive binding in the NGL view
(f/e M2M) Highlighting of atoms/groups with suggested substitutions from FEGrow in the NGL view

@tdudgeon, @kaliif, @boriskovar-m2ms could you please help me with an estimate of the time (and hence cost) of the above spec. (and please do sanity check that these 6 items are the main work required)

tdudgeon commented 1 month ago

Looks interesting. To be able to better assess this I think we need to know:

what the inputs from Fragalysis would be
what the outputs would be
how any outputs (e.g. scores) would be expected to be used (viewed) in Fragalysis
computational complexity (e.g. is parallelisation necessary)

mwinokan commented 1 month ago

Thanks @tdudgeon, I hope the below details help. @boriskovar-m2ms they will likely inform the f/e estimation too

1. what the inputs from Fragalysis would be

LHS hits for merging/elaboration
LHS hits for scoring/substitution prediction
RHS virtual hits for scoring/substitution prediction

All of these would be for a single target at a time

2. what the outputs would be

Virtual hits (with scores/metadata) to create new RHS compound sets
Scores/metadata for LHS compounds
Atomic scoring/substitution results, i.e. metadata mapped to atomic indices of a LHS/RHS hit

3. how any outputs (e.g. scores) would be expected to be used (viewed) in Fragalysis

Compound scores to be shown as RHS metadata (already implemented)
Filtering RHS compounds by metadata (already implemented)
Compound scores to be shown as LHS metadata
Filtering LHS compounds by metadata
Visual indication that atomic scores are available LHS&RHS (button?)
Primitives/effects added to NGL view to indicate key binding atom(s) / interactions
Visual indication that substitution suggestions are available LHS&RHS (button?)
Primitives/effects added to NGL view to show substitution suggestions

4. computational complexity (e.g. is parallelisation necessary)

I have reached out to the developers and will get back to you, I believe that any development required for FEGrow will be covered by their own engineering resources.

mwinokan commented 1 month ago

@tdudgeon regarding 4. above:

In terms of building and scoring a molecule, there are no particular requirements. It has quite a few dependencies, see installation and environment here: https://cole-group.github.io/FEgrow/installation/

Certain parts (like the ML components) can be gpu accelerated, and we use dask for high-throughput work, but these are optional.

boriskovar-m2ms commented 1 month ago

Step zero is to get current implementation of jobs working again because they are (at least on F/E) incompatible with the changes made to the LHS (this work was deferred to later date many months ago).

So my comments already assume that step zero is done.

Inputs for LHS are working so we need to just add inputs from RHS which I kind of already prepared like two years ago so not sure how compatible this is with current state of things. Lets say 2 days.
(this should be point 3 and not 2 not sure why markdown is so stubborn about this) This is a quite a departure from uniform outputs handling.
- Currently we are not showing metadata for LHS. I guess we have to replicate the mechanisms from the RHS (you can pick which columns to show etc.) which is not simple copy and paste because LHS and RHS are two quite different implementations (which is covered in other issue). I would say 5 days.
- Filter for LHS metada - I have no idea what new LHS filter should look like because there were many ideas floating around for few years now BUT if we need to replicate the RHS filter on the LHS then it would be 7 days.
- Visual indication that atomic scores are available LHS&RHS (button?)
No Idea what this should look like. What is difference between the metadata and atomic scores?
- To indicate substitutions and to and some flare to NGL to outline them and key binding atoms is like 10 days (not biggest friend with NGL view so there will be a lot of trial and error as always when working NGL)

mwinokan commented 1 month ago

@tdudgeon could you please give me a very rough estimate of the time needed now that the spec is a bit clearer?

tdudgeon commented 1 month ago

@mwinokan it's very difficult to say without knowing more about the scope. Creating a new Squonk job can be as little as half a day, and then maybe another half day to get it running in Fragalysis. But that assumes there is existing Python code that can just be adapted, and nothing else needs doing, but in this case there definitely is extra work:

Job execution is currently broken, and we don't really know how much work is needed to get it working again, but hopefully probably only a couple of days
We cannot handle RHS inputs currently (#962). This requires f/e and b/e work and is difficult to estimate, but is certainly going to be several days work
Additional work is needed to handle the scoring and filtering. This is probably mostly f/e work, so we can use @boriskovar-m2ms's estimates for that
Job execution is currently quite hacky, and need significant improvement (mostly #1216, #1154, #1059, but probably other things). Ideally this would be done first. Is this in scope?
If parallelisation of the job is needed for performance reasons (e.g. as a Nextflow job) then that's probably at least another day of work.

And then there are the bits we don't yet know about ;-) Could I suggest we set up a call with the people who want to use this tool so that we can find out more about how it works?

mwinokan commented 1 month ago

Thanks @tdudgeon, I will see if the above is enough to provide them with a ballpark figure for their grant, if not I think a call is sensible.

m2ms / fragalysis-frontend