Support incremental RHS uploads with persistent names

mwinokan commented 7 months ago

It's not possible to overwrite a RHS CSET upload, a duplicate is made. Ideally in the case of updating a compound set, when incremental uploads only add metadata to existing entries, the compounds should have a persistent name.

mwinokan commented 5 months ago

@mwinokan to provide @kaliif with example data and firm spec for the persistent compound naming

mwinokan commented 5 months ago

Test Dataset

@kaliif a fresh A71EV2A dataset has been uploaded to staging. I have prepared four test compound sets for testing the incremental upload. These four SDFs all work in the current staging but they create four separate compound sets, with every compound having a unique identifier (even though there are many shared chemical structures). I have added in a version column in the SDF to test the updating of metadata.

A71EV2A_incremental_csets.zip

Across the four sets there are only four unique compounds, with seven unique poses:

Compound ID	Pose ID	upload_1	upload_2	upload_3	upload_4
90	108	Y	Y
90	109	Y	Y	Y	Y
2165	6580	Y	Y	Y	Y
2165	6581	Y	Y	Y	Y
2183	6588	Y	Y		Y
2246	6620				Y
2246	6621				Y

Apologies for using the pose nomenclature to mean something different from the LHS (that's how I call it in my software "HIPPO"). The compound ID's and pose ID's are the unique identifiers from my database.

Specification

A unique compound identifier should be generated for each ligand. I have had the most luck by flattening the SMILES (removing stereochemistry), and then using the InChI-key to generate a unique ID. E.g. A71EV2A-U95X
Each pose/conformer associated to that compound should get it's own related unique identifier, similar to the LHS. E.g. A71EV2A-U95X-a
When performing successive RHS uploads, for each ligand it should be checked whether that chemical structure already has an ID (by using the flat InChI-key method described above), and whether there is a sufficiently similar pose (RMSD of each atom <0.5 Angstrom). Use these rules to match compounds between uploads so that metadata can be updated/superseded.

Expected output from test data

upload_1 should upload mostly as it does now
- Five rows in the RHS compound navigator
- As there are only three unique compounds, poses 1&2 should share the same compound ID, but have different appended pose ID's. E.g. A71EV2A-U95X-a and A71EV2A-U95X-b
upload_2 should cause the version metadata column to be updated to 2 for each compound
- the compound and pose identifiers should be used to match to the correct existing entry
- the metadata columns of the matched entries should be updated from the SDF file
upload_3 should be similar to upload_2 except that there is only a subset of compounds
upload_4 should be similar to upload_3 except that there are now also novel compounds/poses that need registering

@kaliif as this is quite a complicated specification, please can you initially investigate how difficult the various features will be. While I have extensively tested the 'flat InChi-key' matching, it can still be quite finicky, so potentially we could decide not to support having the globally unique identifiers. In which case, the updating of compound set metadata should only be supported if the replacement SDF has the same molecules and in the same order.

mwinokan commented 4 months ago

@kaliif says that the full implementation will be a large (1 week + ) chunk of work. @phraenquex says to prioritise the minimal features that will make the design dissemination less painful

mwinokan commented 4 months ago

@kaliif as a minimal fix could you please support only the metadata superseding with persistent naming for single compound sets.

So for example support superseding upload_1 by upload_2 from the above example data.

Always take the latest metadata from the new SDF, discard the existing metadata
Use the rdkit molecule object from the new SDF
Preserve the existing compound names (assume the order of molecules in the upload is preserved)
Have some simple checks to confirm the number of molecules. I would not check for chemical similarity as the reason for upload may be to fix some bond orders in the molecule, etc.

We will need to test this for uploads with custom protein PDB's, please ping me when you get to that stage and I will provide further test data

mwinokan commented 4 months ago

@kaliif says that the shortcut spec I suggested is not much easier, and Kalev has made good strides implementing the full spec.

kaliif commented 4 months ago

@mwinokan what is the expected result when I upload all 4 sets? When I try to upload, 4 compounds and 5 computedmolecules are created, the rest are discarded because of the low RMSD.

Also, the version metadata column I'm assuming you mean this should be stored in the db?

mwinokan commented 4 months ago

Thanks @kaliif.

4 compounds is definitely correct
5 computed could be fine, but maybe we will have to tweak the RMSD cutoff.

Can you confirm that you are checking the RMSD between each atom of the compound, and not the average RMSD?

Which two pose ID's are coming back as sufficiently unique? is it 108 and 109 (referring to the table above)

The version column is a placeholder for arbitrary metadata that may be in the original SDF, so yes it's important that this gets stored in the database for filtering, sorting, etc. once it's displayed in the F/E.

kaliif commented 4 months ago

@mwinokan I was using average because that's what M means in RMSD. Do you mean then simply the distance between atoms?

mwinokan commented 4 months ago

@kaliif apologies that was a miscommunication on my part!

Yes please, treat any two ligand conformations as identical if they do not have at least one pair of matching atoms 0.5 Angstrom apart.

mwinokan commented 4 months ago

Testing using Kalev's stack:

Upload_1

works great, 3 compounds, 5 poses

Upload_2 (using the update existing set)

the names are correctly persistent
metadata correctly updated
new compound set created (old one is still there, but now contains no compounds)

Delete Upload_1

Removed the upload_1 compound set correctly

Upload_3

works as expected to create a new compound set,
But the compound set from upload_2 now loses the three compounds in upload_3

Upload_4

Same as upload_3

Selected compounds

Seems to be working as expected, however the true test will be when multiple compound sets contain the same compounds/poses

Outstanding issues

[ ] The update option on upload_cset does not actually change the behaviour. When selected the old compound set should be replaced by the new one
[ ] Consecutive uploads containing compounds in previous sets cause the duplicates to be removed from previous sets. New uploads should not change existing compound sets, unless the 'update' option is used

mwinokan commented 4 months ago

Chatting to @kaliif in the meeting about the compound names, using the first four characters of the INChI-keys could lead to clashes:

A71EV2A-BHKV-A, the BHKV is from the InChI-key which will could not be unique across many compounds. Please change to using an integer serial so that we can also roughly tell the order in which compounds were uploaded/registered.

I.e.: A71EV2A-1-A, A71EV2A-3423-A

kaliif commented 4 months ago

@mwinokan see if I got this right:

updating existing compound set should not create a new set but use the existing one
adding new upload (not updating) should add a new compound set but not change the existing ones, meaning computedmolecule should not be removed from the existing set, meaning computedmolecule can belong to multiple compound sets

mwinokan commented 4 months ago

@kaliif yes that's correct

mwinokan commented 4 months ago

@kaliif after speaking to Frank we want to have shorter names for the RHS compounds, that are more similar to the LHS:

Please could you change them to:

v1234a

v for virtual
1234 will be the serial index of the compound
a is the alphabetic index of the pose/conformation

mwinokan commented 4 months ago

@kaliif says this is live on his stack. @mwinokan to test

mwinokan commented 4 months ago

After testing on 2024/06/24, the features as requested in this ticket are working correctly with the test data, thank you @kaliif!

I did spot some F/E changes that I put in #1404

mwinokan commented 3 months ago

From a zoom call with @kaliif

While preparing RHS test data for @boriskovar-m2ms, we discovered some outstanding issues on this ticket. Namely the 'version' metadata column should not be explicitly required. Instead:

Any new RHS upload containing molecules matching previously uploaded ones (matched via inchikey and atomic distance) should overwrite the previous metadata, even if this causes a loss of old metadata.

@kaliif please remove the need for the 'version' column, and implement the above. Please also include a warning in the logs including a full dump of the metadata text being overwritten, so that in the worst case it can be retrieved from the logs.

A final thought I had, are the unprocessed SDFs of each upload stored on the filesystem somewhere? Those could help in the rare case of unwanted metadata loss

mwinokan commented 2 months ago

@kaliif please check that the above was implemented (I think it was) and then update the status of this ticket

kaliif commented 2 months ago

This is done and merged to staging

m2ms / fragalysis-frontend