m2ms / fragalysis-frontend

The React, Redux frontend built by webpack
Other
1 stars 1 forks source link

Support incremental RHS uploads with persistent names #1394

Open mwinokan opened 7 months ago

mwinokan commented 7 months ago

It's not possible to overwrite a RHS CSET upload, a duplicate is made. Ideally in the case of updating a compound set, when incremental uploads only add metadata to existing entries, the compounds should have a persistent name.

mwinokan commented 5 months ago

@mwinokan to provide @kaliif with example data and firm spec for the persistent compound naming

mwinokan commented 5 months ago

Test Dataset

@kaliif a fresh A71EV2A dataset has been uploaded to staging. I have prepared four test compound sets for testing the incremental upload. These four SDFs all work in the current staging but they create four separate compound sets, with every compound having a unique identifier (even though there are many shared chemical structures). I have added in a version column in the SDF to test the updating of metadata.

A71EV2A_incremental_csets.zip

Across the four sets there are only four unique compounds, with seven unique poses:

Compound ID Pose ID upload_1 upload_2 upload_3 upload_4
90 108 Y Y
90 109 Y Y Y Y
2165 6580 Y Y Y Y
2165 6581 Y Y Y Y
2183 6588 Y Y Y
2246 6620 Y
2246 6621 Y

Apologies for using the pose nomenclature to mean something different from the LHS (that's how I call it in my software "HIPPO"). The compound ID's and pose ID's are the unique identifiers from my database.

Specification

Expected output from test data

  1. upload_1 should upload mostly as it does now

    • Five rows in the RHS compound navigator
    • As there are only three unique compounds, poses 1&2 should share the same compound ID, but have different appended pose ID's. E.g. A71EV2A-U95X-a and A71EV2A-U95X-b
  2. upload_2 should cause the version metadata column to be updated to 2 for each compound

    • the compound and pose identifiers should be used to match to the correct existing entry
    • the metadata columns of the matched entries should be updated from the SDF file
  3. upload_3 should be similar to upload_2 except that there is only a subset of compounds

  4. upload_4 should be similar to upload_3 except that there are now also novel compounds/poses that need registering

@kaliif as this is quite a complicated specification, please can you initially investigate how difficult the various features will be. While I have extensively tested the 'flat InChi-key' matching, it can still be quite finicky, so potentially we could decide not to support having the globally unique identifiers. In which case, the updating of compound set metadata should only be supported if the replacement SDF has the same molecules and in the same order.

mwinokan commented 4 months ago

@kaliif says that the full implementation will be a large (1 week + ) chunk of work. @phraenquex says to prioritise the minimal features that will make the design dissemination less painful

mwinokan commented 4 months ago

@kaliif as a minimal fix could you please support only the metadata superseding with persistent naming for single compound sets.

So for example support superseding upload_1 by upload_2 from the above example data.

We will need to test this for uploads with custom protein PDB's, please ping me when you get to that stage and I will provide further test data

mwinokan commented 4 months ago

@kaliif says that the shortcut spec I suggested is not much easier, and Kalev has made good strides implementing the full spec.

kaliif commented 4 months ago

@mwinokan what is the expected result when I upload all 4 sets? When I try to upload, 4 compounds and 5 computedmolecules are created, the rest are discarded because of the low RMSD.

Also, the version metadata column I'm assuming you mean this should be stored in the db?

mwinokan commented 4 months ago

Thanks @kaliif.

Can you confirm that you are checking the RMSD between each atom of the compound, and not the average RMSD?

Which two pose ID's are coming back as sufficiently unique? is it 108 and 109 (referring to the table above)

The version column is a placeholder for arbitrary metadata that may be in the original SDF, so yes it's important that this gets stored in the database for filtering, sorting, etc. once it's displayed in the F/E.

kaliif commented 4 months ago

@mwinokan I was using average because that's what M means in RMSD. Do you mean then simply the distance between atoms?

mwinokan commented 4 months ago

@kaliif apologies that was a miscommunication on my part!

Yes please, treat any two ligand conformations as identical if they do not have at least one pair of matching atoms 0.5 Angstrom apart.

mwinokan commented 4 months ago

Testing using Kalev's stack:

Upload_1

Upload_2 (using the update existing set)

Delete Upload_1

Upload_3

Upload_4

Selected compounds

Outstanding issues

mwinokan commented 4 months ago

Chatting to @kaliif in the meeting about the compound names, using the first four characters of the INChI-keys could lead to clashes:

A71EV2A-BHKV-A, the BHKV is from the InChI-key which will could not be unique across many compounds. Please change to using an integer serial so that we can also roughly tell the order in which compounds were uploaded/registered.

I.e.: A71EV2A-1-A, A71EV2A-3423-A

kaliif commented 4 months ago

@mwinokan see if I got this right:

mwinokan commented 4 months ago

@kaliif yes that's correct

mwinokan commented 4 months ago

@kaliif after speaking to Frank we want to have shorter names for the RHS compounds, that are more similar to the LHS:

Please could you change them to:

v1234a

mwinokan commented 4 months ago

@kaliif says this is live on his stack. @mwinokan to test

mwinokan commented 4 months ago

After testing on 2024/06/24, the features as requested in this ticket are working correctly with the test data, thank you @kaliif!

I did spot some F/E changes that I put in #1404

mwinokan commented 3 months ago

From a zoom call with @kaliif

While preparing RHS test data for @boriskovar-m2ms, we discovered some outstanding issues on this ticket. Namely the 'version' metadata column should not be explicitly required. Instead:

Any new RHS upload containing molecules matching previously uploaded ones (matched via inchikey and atomic distance) should overwrite the previous metadata, even if this causes a loss of old metadata.

@kaliif please remove the need for the 'version' column, and implement the above. Please also include a warning in the logs including a full dump of the metadata text being overwritten, so that in the worst case it can be retrieved from the logs.

A final thought I had, are the unprocessed SDFs of each upload stored on the filesystem somewhere? Those could help in the rare case of unwanted metadata loss

mwinokan commented 2 months ago

@kaliif please check that the above was implemented (I think it was) and then update the status of this ticket

kaliif commented 2 months ago

This is done and merged to staging