Open mwinokan opened 7 months ago
@mwinokan to provide @kaliif with example data and firm spec for the persistent compound naming
@kaliif a fresh A71EV2A dataset has been uploaded to staging. I have prepared four test compound sets for testing the incremental upload. These four SDFs all work in the current staging but they create four separate compound sets, with every compound having a unique identifier (even though there are many shared chemical structures). I have added in a version column in the SDF to test the updating of metadata.
Across the four sets there are only four unique compounds, with seven unique poses:
Compound ID | Pose ID | upload_1 | upload_2 | upload_3 | upload_4 |
---|---|---|---|---|---|
90 | 108 | Y | Y | ||
90 | 109 | Y | Y | Y | Y |
2165 | 6580 | Y | Y | Y | Y |
2165 | 6581 | Y | Y | Y | Y |
2183 | 6588 | Y | Y | Y | |
2246 | 6620 | Y | |||
2246 | 6621 | Y |
Apologies for using the pose nomenclature to mean something different from the LHS (that's how I call it in my software "HIPPO"). The compound ID's and pose ID's are the unique identifiers from my database.
A71EV2A-U95X
A71EV2A-U95X-a
upload_1 should upload mostly as it does now
A71EV2A-U95X-a
and A71EV2A-U95X-b
upload_2 should cause the version
metadata column to be updated to 2
for each compound
upload_3 should be similar to upload_2 except that there is only a subset of compounds
upload_4 should be similar to upload_3 except that there are now also novel compounds/poses that need registering
@kaliif as this is quite a complicated specification, please can you initially investigate how difficult the various features will be. While I have extensively tested the 'flat InChi-key' matching, it can still be quite finicky, so potentially we could decide not to support having the globally unique identifiers. In which case, the updating of compound set metadata should only be supported if the replacement SDF has the same molecules and in the same order.
@kaliif says that the full implementation will be a large (1 week + ) chunk of work. @phraenquex says to prioritise the minimal features that will make the design dissemination less painful
@kaliif as a minimal fix could you please support only the metadata superseding with persistent naming for single compound sets.
So for example support superseding upload_1 by upload_2 from the above example data.
We will need to test this for uploads with custom protein PDB's, please ping me when you get to that stage and I will provide further test data
@kaliif says that the shortcut spec I suggested is not much easier, and Kalev has made good strides implementing the full spec.
@mwinokan what is the expected result when I upload all 4 sets? When I try to upload, 4 compounds and 5 computedmolecules are created, the rest are discarded because of the low RMSD.
Also, the version metadata column I'm assuming you mean this should be stored in the db?
Thanks @kaliif.
Can you confirm that you are checking the RMSD between each atom of the compound, and not the average RMSD?
Which two pose ID's are coming back as sufficiently unique? is it 108 and 109 (referring to the table above)
The version
column is a placeholder for arbitrary metadata that may be in the original SDF, so yes it's important that this gets stored in the database for filtering, sorting, etc. once it's displayed in the F/E.
@mwinokan I was using average because that's what M means in RMSD. Do you mean then simply the distance between atoms?
@kaliif apologies that was a miscommunication on my part!
Yes please, treat any two ligand conformations as identical if they do not have at least one pair of matching atoms 0.5 Angstrom apart.
Testing using Kalev's stack:
upload_cset
does not actually change the behaviour. When selected the old compound set should be replaced by the new oneChatting to @kaliif in the meeting about the compound names, using the first four characters of the INChI-keys could lead to clashes:
A71EV2A-BHKV-A
, the BHKV is from the InChI-key which will could not be unique across many compounds. Please change to using an integer serial so that we can also roughly tell the order in which compounds were uploaded/registered.
I.e.: A71EV2A-1-A
, A71EV2A-3423-A
@mwinokan see if I got this right:
@kaliif yes that's correct
@kaliif after speaking to Frank we want to have shorter names for the RHS compounds, that are more similar to the LHS:
Please could you change them to:
v1234a
v
for virtual1234
will be the serial index of the compounda
is the alphabetic index of the pose/conformation@kaliif says this is live on his stack. @mwinokan to test
After testing on 2024/06/24, the features as requested in this ticket are working correctly with the test data, thank you @kaliif!
I did spot some F/E changes that I put in #1404
From a zoom call with @kaliif
While preparing RHS test data for @boriskovar-m2ms, we discovered some outstanding issues on this ticket. Namely the 'version' metadata column should not be explicitly required. Instead:
Any new RHS upload containing molecules matching previously uploaded ones (matched via inchikey and atomic distance) should overwrite the previous metadata, even if this causes a loss of old metadata.
@kaliif please remove the need for the 'version' column, and implement the above. Please also include a warning in the logs including a full dump of the metadata text being overwritten, so that in the worst case it can be retrieved from the logs.
A final thought I had, are the unprocessed SDFs of each upload stored on the filesystem somewhere? Those could help in the rare case of unwanted metadata loss
@kaliif please check that the above was implemented (I think it was) and then update the status of this ticket
This is done and merged to staging
It's not possible to overwrite a RHS CSET upload, a duplicate is made. Ideally in the case of updating a compound set, when incremental uploads only add metadata to existing entries, the compounds should have a persistent name.