Verify correct RHS compound ID behaviour

mwinokan commented 2 months ago

Currently for CHIKV_Mac these similar molecules are assigned four different compound ID's

OC[C@@H]1CCC[C@@H]1Nc1ccc2[nH]cnc2c1Cl: v4a
OC[C@H]1CCC[C@@H]1Nc1ccc2[nH]cnc2c1Cl: v5a
OC[C@@H]1CCC[C@H]1Nc1ccc2[nH]cnc2c1Cl: v6a
OC[C@H]1CCC[C@H]1Nc1ccc2[nH]cnc2c1Cl: v7a

Apart from 3d conformer these should be considered the same "soaked" molecule because the spec we discussed in #1394 would use the flattened inchikey which is the same for all four molecules (WHLDZOYDHAHKLT-UHFFFAOYSA-N).

@kaliif please could you investigate how the four compounds from this C1965_poses_example.sdf.zip are all assigned different compound ID's?

kaliif commented 1 month ago

@mwinokan the spec had 2 key points - flattened structure and distance in 3D space between atoms. The distance limit is set to 0.5A and if you look at the corresponding atom coordinates in the file, they are quite different and far from each other. The way I read the spec, saving four different compound is correct behaviour here.

mwinokan commented 1 month ago

Thanks @kaliif, indeed those were the two criteria I remember as well. Having read the original spec comment, I think I was not sufficiently clear that the first should be used to identify similar compounds, while the second is to identify similar poses.

So the four smiles strings should be flattened, after which they correspond to the same chemical structure and thus get assigned the same compound ID, e.g. v4.

Then if the 3d coordinates are sufficiently different they will all be assigned separate pose ID's: e.g. v4a, v4b etc.

Does that make sense?

kaliif commented 1 month ago

@mwinokan I think it does and I believe that's exactly what's happening with the exception that the compound IDs are not compound IDs but pose IDs. This is how it looks like in the database:

frag=# select * from viewer_compound where inchi_key = 'WHLDZOYDHAHKLT-UHFFFAOYSA-N';
 id  |                                                    inchi                                                     |            smiles            |     current_identifier      | all_identifiers | description | comments | compound_code |          inchi_key          
-----+--------------------------------------------------------------------------------------------------------------+------------------------------+-----------------------------+-----------------+-------------+----------+---------------+-----------------------------
 130 | InChI=1S/C13H16ClN3O/c14-12-10(4-5-11-13(12)16-7-15-11)17-9-3-1-2-8(9)6-18/h4-5,7-9,17-18H,1-3,6H2,(H,15,16) | OCC1CCCC1Nc1ccc2nc[nH]c2c1Cl | WHLDZOYDHAHKLT-IUCAKERBSA-N |                 |             |          |               | WHLDZOYDHAHKLT-UHFFFAOYSA-N

frag=# select id, name, smiles, compound_id, identifier, molecule_name, site_observation_code, reference_code, pdb_id from viewer_computedmolecule where compound_id = 130;
 id | name |                 smiles                 | compound_id | identifier |        molecule_name        | site_observation_code | reference_code | pdb_id 
----+------+----------------------------------------+-------------+------------+-----------------------------+-----------------------+----------------+--------
  1 | v1a  | OC[C@@H]1CCC[C@@H]1Nc1ccc2[nH]cnc2c1Cl |         130 | 3PZ5       | WHLDZOYDHAHKLT-IUCAKERBSA-N | c0300a                | c0300a         |      8
  2 | v2a  | OC[C@H]1CCC[C@@H]1Nc1ccc2[nH]cnc2c1Cl  |         130 | 9F4X       | WHLDZOYDHAHKLT-BDAKNGLRSA-N | c0300a                | c0300a         |      8
  3 | v3a  | OC[C@@H]1CCC[C@H]1Nc1ccc2[nH]cnc2c1Cl  |         130 | RE9D       | WHLDZOYDHAHKLT-DTWKUNHWSA-N | c0300a                | c0300a         |      8
  4 | v4a  | OC[C@H]1CCC[C@H]1Nc1ccc2[nH]cnc2c1Cl   |         130 | VAEU       | WHLDZOYDHAHKLT-RKDXNWHRSA-N | c0300a                | c0300a         |      8

A single compound gets created, but as the coordinates are different, 4 poses (ComputedMolecules in backend terminology). It's from my local build so the pose IDs are different but I think you'll see what's happening. Is that what you had in mind with the spec?

If you need them grouped under a single compound ID, then the concept of compound identifier needs revisiting, currently compound model doesn't have anything human-readable in the table.

mwinokan commented 1 month ago

@kaliif if the RHS molecules have the same compound_id they should be named with the same number/prefix: i.e. v4. Then just like in the LHS, different poses can be signified with the observation letter appended to the name: e.g. v4a, v4b

mwinokan commented 1 month ago

Hi @kaliif I have encountered an error I have not seen before when trying to upload a new conformer of an existing RHS compound:

Traceback (most recent call last): File "/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 453, in trace_task R = retval = fun(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^ File "/.venv/lib/python3.11/site-packages/sentry_sdk/integrations/celery.py", line 325, in _inner reraise(*exc_info) File "/.venv/lib/python3.11/site-packages/sentry_sdk/_compat.py", line 127, in reraise raise value File "/.venv/lib/python3.11/site-packages/sentry_sdk/integrations/celery.py", line 320, in _inner return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 736, in __protected_call__ return self.run(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/code/viewer/tasks.py", line 106, in process_compound_set compound_set = save_mols.task() ^^^^^^^^^^^^^^^^ File "/code/viewer/cset_upload.py", line 686, in task _ = self.process_mol( ^^^^^^^^^^^^^^^^^ File "/code/viewer/cset_upload.py", line 557, in process_mol return self.set_props(cpd, other_props, compound_set) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/code/viewer/cset_upload.py", line 333, in set_props score_value.score = ScoreDescription.objects.get( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.venv/lib/python3.11/site-packages/django/db/models/manager.py", line 85, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.venv/lib/python3.11/site-packages/django/db/models/query.py", line 439, in get raise self.model.MultipleObjectsReturned( viewer.models.ScoreDescription.MultipleObjectsReturned: get() returned more than one ScoreDescription -- it returned 2!

/viewer/upload_task/8ac187e7-dca0-4a9f-b06e-8921edf6f5cc/

The SDFs I used are in this archive: A71EV2A_fakeRHS_1524_test.zip

kaliif commented 1 month ago

@mwinokan what is the expected behaviour?

mwinokan commented 1 month ago

@kaliif In this case it's a new conformer for an existing compound so it should get assigned something like v1b. If needed I can get back to you with a more precise specification of the assigned IDs

kaliif commented 1 month ago

@mwinokan ok, that part does actually happen, v1b is being created. The problem occurs later, when it's saving properties, NumericalScoreValues and TextScoreValues and descriptions for these, which are stored in the third object type - ScoreDescription. The error you're seeing is that the backend has created two ScoreDescription instances for a property (<method> in this case) and doesn't know which one to use. This is the sate at the moment of the crash:

frag=# select * from viewer_scoredescription;
 id |         name          |             description             |          computed_set_id          
----+-----------------------+-------------------------------------+-----------------------------------
  1 | ref_url               | https://hippo.winokan.com           | A71EV2A_lhs2rhs_2024-2024-10-24-A
  2 | submitter_name        | Max Winokan                         | A71EV2A_lhs2rhs_2024-2024-10-24-A
  3 | submitter_email       | max.winokan@diamond.ac.uk           | A71EV2A_lhs2rhs_2024-2024-10-24-A
  4 | submitter_institution | DLS                                 | A71EV2A_lhs2rhs_2024-2024-10-24-A
  5 | generation_date       | 2023-08-21                          | A71EV2A_lhs2rhs_2024-2024-10-24-A
  6 | method                | A71EV2A_lhs2rhs_20241010            | A71EV2A_lhs2rhs_2024-2024-10-24-A
  7 | ID                    | ver_1.2                             | A71EV2A_lhs2rhs_2024-2024-10-24-A
  8 | method                | A71EV2A_lhs2rhs_20241010_first3_mod | A71EV2A_lhs2rhs_2024-2024-10-24-A

Looking at the code, I'm not sure what would be the correct course of action here:

treat description field as significant and consider instances 6 and 8 as distinct objects or
consider name to be unique and prevent creating new objects with same name?

mwinokan commented 1 month ago

Thanks @kaliif for explaining how the error occurs. It's not instantly clear what the most elegant solution is, so let me ask some follow up questions:

Correct me if I'm wrong, but if I had uploaded this second SDF with the method value as A71EV2A_lhs2rhs_20241010 (i.e. the same as the first SDF) it would not have caused this issue?
If you implemented the second option; would this then throw an error if a subsequent upload references the same molecule as uploaded previously but with a different method (or other name) value?
It definitely seems desirable to me to have two uploads of the same molecules with different metadata to be supported, and for those different origins to be documented. This would rule out your second option, right?
Would your first option mean that the API serves an iterable of description values for each name key of a given molecule/conformer?

It seems to me that this issue needs more careful thought, spec, and both b/e and f/e work, in the meantime would it be easy to instead provide a warning and use the description values from the first upload? The warning could say something like:

Warning: using existing "method" value of "A71EV2A_lhs2rhs_20241010" instead of "A71EV2A_lhs2rhs_20241010_first3_mod" for existing compound "v1"

PS (more of a note to self):

The motivation for specifying a different method was to clearly distinguish the two compound sets as viewed in the f/e, as sets with the same method will have very similar looking computed_set_id values. (That's really a problem to fix on the f/e as the text is often clipped so you only see the first part)

kaliif commented 1 month ago

Correct me if I'm wrong, but if I had uploaded this second SDF with the method value as A71EV2A_lhs2rhs_20241010 (i.e. the same as the first SDF) it would not have caused this issue?

@mwinokan correct, the code that creates ScoreDesciptions uses get_or_create method, so if the object with the given parameters exist, the new one won't be created, and the subsequent lookup would not stumble upon multiple objects.

If you implemented the second option; would this then throw an error if a subsequent upload references the same molecule as uploaded previously but with a different method (or other name) value?

No, it would be the same get_or_create method as above, if exists, no need to create a new one.

It definitely seems desirable to me to have two uploads of the same molecules with different metadata to be supported, and for those different origins to be documented. This would rule out your second option, right?

Yes. That's really what it boils down to - is the description field significant or not.

Would your first option mean that the API serves an iterable of description values for each name key of a given molecule/conformer?

There's an endpoint api/compound-scores/ which retrieves the contents of ScoreDescription table, and api/numerical-scores/ and api/text-scores/ which return the individual score values. The linked ScoreDesciption value is nested under the score object, so while not 100% sure, it seems nothing needs to change in the FE in that respect.

It seems to me that this issue needs more careful thought, spec, and both b/e and f/e work, in the meantime would it be easy to instead provide a warning and use the description values from the first upload? The warning could say something like:

Warning: using existing "method" value of "A71EV2A_lhs2rhs_20241010" instead of "A71EV2A_lhs2rhs_20241010_first3_mod" for existing compound "v1"

Not for me to say which is the correct solution, but I'll just mention that applying either solution 1 or 2 is waaay quicker than issuing a warning, currently there's no mechanism to send messages upon successful completion, and adding this feature would require quite a bit of work making the Django view and Celery task play together nicely. Doable, but it's not a few lines of code.

mwinokan commented 1 month ago

Thanks for explaining @kaliif. Let's go for your option 1:

treat description field as significant and consider instances 6 and 8 as distinct objects

Could this exception have been encountered before the work in you did in this ticket? I'm asking because if the bug is already present in production then I don't mind pushing the current staging version to production, and I'll just have to remember to be more careful with molecule duplicates

kaliif commented 1 month ago

@mwinokan this code is certainly in production now so it's possible someone has seen it (and chosen to ignore the error message because I don't believe this has been reported before). The result then is that some molecules, certain scores and descriptors, really anything in the file after the point of failure, would not have been added to the database. And no, the already added objects would not have been cleaned up, this process doesn't run in a transaction block

It would be really good idea to reversible transaction block but there's lots of code and multiple celery tasks involved so it's probably several days worth of work. Some later release maybe?

kaliif commented 1 month ago

@mwinokan something else I noticed - uploading the second sdf overwrites the one uploaded before, so when you're trying to download later, you'll only get the file with 2 compounds not 5. Is that a problem?

phraenquex commented 3 weeks ago

@mwinokan to summarise: @kaliif asks whether RHS uploads should overwrite the old stuff, so it vanishes forever.

@phraenquex wonders whether it's really a F/E thing: whether what is needed is that uploader should be able to hide things that are superceded.

@mwinokan to finalise the spec.

mwinokan commented 3 weeks ago

@kaliif asks how multiple SDF uploads referring to the same compound sets should be handled:

@mwinokan says that the f/e should offer:

an SDF download that reflects exactly the compound set as it is displayed in the f/e (metadata and fragalyis-assigned IDs)
one or more SDF in their original state as uploaded, for debugging or otherwise

Waztom commented 2 weeks ago

This will be resolved once #1432 is fixed.

mwinokan commented 1 week ago

@kaliif did any changes make it into production in the last release?

There seems to be a different but also incorrect behaviour now where compounds of clearly different chemical structure are assigned the same ID:

kaliif commented 1 week ago

@mwinokan no, this is not in prod yet. Looks like a separate issue.

mwinokan commented 1 week ago

@mwinokan to test and isolate the new issue observed with FatA on production

mwinokan commented 4 days ago

@kaliif the problem can be recreated with the A71EV2A target and the A71EV2A_lhs2rhs_20241010_capitalprefix.sdf RHS example data in the shared folder. All compounds given the ID V1A:

Screenshot 2024-11-19 at 16 26 38

m2ms / fragalysis-frontend

Verify correct RHS compound ID behaviour #1524