Open mwinokan opened 2 months ago
@mwinokan the spec had 2 key points - flattened structure and distance in 3D space between atoms. The distance limit is set to 0.5A and if you look at the corresponding atom coordinates in the file, they are quite different and far from each other. The way I read the spec, saving four different compound is correct behaviour here.
Thanks @kaliif, indeed those were the two criteria I remember as well. Having read the original spec comment, I think I was not sufficiently clear that the first should be used to identify similar compounds, while the second is to identify similar poses.
So the four smiles strings should be flattened, after which they correspond to the same chemical structure and thus get assigned the same compound ID, e.g. v4
.
Then if the 3d coordinates are sufficiently different they will all be assigned separate pose ID's: e.g. v4a
, v4b
etc.
Does that make sense?
@mwinokan I think it does and I believe that's exactly what's happening with the exception that the compound IDs are not compound IDs but pose IDs. This is how it looks like in the database:
frag=# select * from viewer_compound where inchi_key = 'WHLDZOYDHAHKLT-UHFFFAOYSA-N';
id | inchi | smiles | current_identifier | all_identifiers | description | comments | compound_code | inchi_key
-----+--------------------------------------------------------------------------------------------------------------+------------------------------+-----------------------------+-----------------+-------------+----------+---------------+-----------------------------
130 | InChI=1S/C13H16ClN3O/c14-12-10(4-5-11-13(12)16-7-15-11)17-9-3-1-2-8(9)6-18/h4-5,7-9,17-18H,1-3,6H2,(H,15,16) | OCC1CCCC1Nc1ccc2nc[nH]c2c1Cl | WHLDZOYDHAHKLT-IUCAKERBSA-N | | | | | WHLDZOYDHAHKLT-UHFFFAOYSA-N
frag=# select id, name, smiles, compound_id, identifier, molecule_name, site_observation_code, reference_code, pdb_id from viewer_computedmolecule where compound_id = 130;
id | name | smiles | compound_id | identifier | molecule_name | site_observation_code | reference_code | pdb_id
----+------+----------------------------------------+-------------+------------+-----------------------------+-----------------------+----------------+--------
1 | v1a | OC[C@@H]1CCC[C@@H]1Nc1ccc2[nH]cnc2c1Cl | 130 | 3PZ5 | WHLDZOYDHAHKLT-IUCAKERBSA-N | c0300a | c0300a | 8
2 | v2a | OC[C@H]1CCC[C@@H]1Nc1ccc2[nH]cnc2c1Cl | 130 | 9F4X | WHLDZOYDHAHKLT-BDAKNGLRSA-N | c0300a | c0300a | 8
3 | v3a | OC[C@@H]1CCC[C@H]1Nc1ccc2[nH]cnc2c1Cl | 130 | RE9D | WHLDZOYDHAHKLT-DTWKUNHWSA-N | c0300a | c0300a | 8
4 | v4a | OC[C@H]1CCC[C@H]1Nc1ccc2[nH]cnc2c1Cl | 130 | VAEU | WHLDZOYDHAHKLT-RKDXNWHRSA-N | c0300a | c0300a | 8
A single compound gets created, but as the coordinates are different, 4 poses (ComputedMolecules in backend terminology). It's from my local build so the pose IDs are different but I think you'll see what's happening. Is that what you had in mind with the spec?
If you need them grouped under a single compound ID, then the concept of compound identifier needs revisiting, currently compound model doesn't have anything human-readable in the table.
@kaliif if the RHS molecules have the same compound_id they should be named with the same number/prefix: i.e. v4
. Then just like in the LHS, different poses can be signified with the observation letter appended to the name: e.g. v4a
, v4b
Hi @kaliif I have encountered an error I have not seen before when trying to upload a new conformer of an existing RHS compound:
Traceback (most recent call last): File "/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 453, in trace_task R = retval = fun(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^ File "/.venv/lib/python3.11/site-packages/sentry_sdk/integrations/celery.py", line 325, in _inner reraise(*exc_info) File "/.venv/lib/python3.11/site-packages/sentry_sdk/_compat.py", line 127, in reraise raise value File "/.venv/lib/python3.11/site-packages/sentry_sdk/integrations/celery.py", line 320, in _inner return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 736, in __protected_call__ return self.run(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/code/viewer/tasks.py", line 106, in process_compound_set compound_set = save_mols.task() ^^^^^^^^^^^^^^^^ File "/code/viewer/cset_upload.py", line 686, in task _ = self.process_mol( ^^^^^^^^^^^^^^^^^ File "/code/viewer/cset_upload.py", line 557, in process_mol return self.set_props(cpd, other_props, compound_set) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/code/viewer/cset_upload.py", line 333, in set_props score_value.score = ScoreDescription.objects.get( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.venv/lib/python3.11/site-packages/django/db/models/manager.py", line 85, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.venv/lib/python3.11/site-packages/django/db/models/query.py", line 439, in get raise self.model.MultipleObjectsReturned( viewer.models.ScoreDescription.MultipleObjectsReturned: get() returned more than one ScoreDescription -- it returned 2!
/viewer/upload_task/8ac187e7-dca0-4a9f-b06e-8921edf6f5cc/
The SDFs I used are in this archive: A71EV2A_fakeRHS_1524_test.zip
@mwinokan what is the expected behaviour?
@kaliif In this case it's a new conformer for an existing compound so it should get assigned something like v1b
. If needed I can get back to you with a more precise specification of the assigned IDs
@mwinokan ok, that part does actually happen, v1b
is being created. The problem occurs later, when it's saving properties, NumericalScoreValues
and TextScoreValues
and descriptions for these, which are stored in the third object type - ScoreDescription
. The error you're seeing is that the backend has created two ScoreDescription
instances for a property (<method>
in this case) and doesn't know which one to use. This is the sate at the moment of the crash:
frag=# select * from viewer_scoredescription;
id | name | description | computed_set_id
----+-----------------------+-------------------------------------+-----------------------------------
1 | ref_url | https://hippo.winokan.com | A71EV2A_lhs2rhs_2024-2024-10-24-A
2 | submitter_name | Max Winokan | A71EV2A_lhs2rhs_2024-2024-10-24-A
3 | submitter_email | max.winokan@diamond.ac.uk | A71EV2A_lhs2rhs_2024-2024-10-24-A
4 | submitter_institution | DLS | A71EV2A_lhs2rhs_2024-2024-10-24-A
5 | generation_date | 2023-08-21 | A71EV2A_lhs2rhs_2024-2024-10-24-A
6 | method | A71EV2A_lhs2rhs_20241010 | A71EV2A_lhs2rhs_2024-2024-10-24-A
7 | ID | ver_1.2 | A71EV2A_lhs2rhs_2024-2024-10-24-A
8 | method | A71EV2A_lhs2rhs_20241010_first3_mod | A71EV2A_lhs2rhs_2024-2024-10-24-A
Looking at the code, I'm not sure what would be the correct course of action here:
description
field as significant and consider instances 6 and 8 as distinct objects orname
to be unique and prevent creating new objects with same name
?Thanks @kaliif for explaining how the error occurs. It's not instantly clear what the most elegant solution is, so let me ask some follow up questions:
Correct me if I'm wrong, but if I had uploaded this second SDF with the method
value as A71EV2A_lhs2rhs_20241010
(i.e. the same as the first SDF) it would not have caused this issue?
If you implemented the second option; would this then throw an error if a subsequent upload references the same molecule as uploaded previously but with a different method
(or other name
) value?
It definitely seems desirable to me to have two uploads of the same molecules with different metadata to be supported, and for those different origins to be documented. This would rule out your second option, right?
Would your first option mean that the API serves an iterable of description
values for each name
key of a given molecule/conformer?
It seems to me that this issue needs more careful thought, spec, and both b/e and f/e work, in the meantime would it be easy to instead provide a warning and use the description
values from the first upload? The warning could say something like:
Warning: using existing "method" value of "A71EV2A_lhs2rhs_20241010" instead of "A71EV2A_lhs2rhs_20241010_first3_mod" for existing compound "v1"
PS (more of a note to self):
The motivation for specifying a different method was to clearly distinguish the two compound sets as viewed in the f/e, as sets with the same method
will have very similar looking computed_set_id
values. (That's really a problem to fix on the f/e as the text is often clipped so you only see the first part)
- Correct me if I'm wrong, but if I had uploaded this second SDF with the
method
value asA71EV2A_lhs2rhs_20241010
(i.e. the same as the first SDF) it would not have caused this issue?
@mwinokan correct, the code that creates ScoreDesciptions
uses get_or_create
method, so if the object with the given parameters exist, the new one won't be created, and the subsequent lookup would not stumble upon multiple objects.
- If you implemented the second option; would this then throw an error if a subsequent upload references the same molecule as uploaded previously but with a different
method
(or othername
) value?
No, it would be the same get_or_create
method as above, if exists, no need to create a new one.
- It definitely seems desirable to me to have two uploads of the same molecules with different metadata to be supported, and for those different origins to be documented. This would rule out your second option, right?
Yes. That's really what it boils down to - is the description
field significant or not.
- Would your first option mean that the API serves an iterable of
description
values for eachname
key of a given molecule/conformer?
There's an endpoint api/compound-scores/
which retrieves the contents of ScoreDescription
table, and api/numerical-scores/
and api/text-scores/
which return the individual score values. The linked ScoreDesciption
value is nested under the score object, so while not 100% sure, it seems nothing needs to change in the FE in that respect.
It seems to me that this issue needs more careful thought, spec, and both b/e and f/e work, in the meantime would it be easy to instead provide a warning and use the
description
values from the first upload? The warning could say something like:Warning: using existing "method" value of "A71EV2A_lhs2rhs_20241010" instead of "A71EV2A_lhs2rhs_20241010_first3_mod" for existing compound "v1"
Not for me to say which is the correct solution, but I'll just mention that applying either solution 1 or 2 is waaay quicker than issuing a warning, currently there's no mechanism to send messages upon successful completion, and adding this feature would require quite a bit of work making the Django view and Celery task play together nicely. Doable, but it's not a few lines of code.
Thanks for explaining @kaliif. Let's go for your option 1:
treat description field as significant and consider instances 6 and 8 as distinct objects
Could this exception have been encountered before the work in you did in this ticket? I'm asking because if the bug is already present in production then I don't mind pushing the current staging version to production, and I'll just have to remember to be more careful with molecule duplicates
@mwinokan this code is certainly in production now so it's possible someone has seen it (and chosen to ignore the error message because I don't believe this has been reported before). The result then is that some molecules, certain scores and descriptors, really anything in the file after the point of failure, would not have been added to the database. And no, the already added objects would not have been cleaned up, this process doesn't run in a transaction block
It would be really good idea to reversible transaction block but there's lots of code and multiple celery tasks involved so it's probably several days worth of work. Some later release maybe?
@mwinokan something else I noticed - uploading the second sdf overwrites the one uploaded before, so when you're trying to download later, you'll only get the file with 2 compounds not 5. Is that a problem?
@mwinokan to summarise: @kaliif asks whether RHS uploads should overwrite the old stuff, so it vanishes forever.
@phraenquex wonders whether it's really a F/E thing: whether what is needed is that uploader should be able to hide things that are superceded.
@mwinokan to finalise the spec.
@kaliif asks how multiple SDF uploads referring to the same compound sets should be handled:
@mwinokan says that the f/e should offer:
This will be resolved once #1432 is fixed.
@kaliif did any changes make it into production in the last release?
There seems to be a different but also incorrect behaviour now where compounds of clearly different chemical structure are assigned the same ID:
@mwinokan no, this is not in prod yet. Looks like a separate issue.
@mwinokan to test and isolate the new issue observed with FatA on production
@kaliif the problem can be recreated with the A71EV2A target and the A71EV2A_lhs2rhs_20241010_capitalprefix.sdf
RHS example data in the shared folder. All compounds given the ID V1A:
Currently for CHIKV_Mac these similar molecules are assigned four different compound ID's
OC[C@@H]1CCC[C@@H]1Nc1ccc2[nH]cnc2c1Cl
: v4aOC[C@H]1CCC[C@@H]1Nc1ccc2[nH]cnc2c1Cl
: v5aOC[C@@H]1CCC[C@H]1Nc1ccc2[nH]cnc2c1Cl
: v6aOC[C@H]1CCC[C@H]1Nc1ccc2[nH]cnc2c1Cl
: v7aApart from 3d conformer these should be considered the same "soaked" molecule because the spec we discussed in #1394 would use the flattened inchikey which is the same for all four molecules (
WHLDZOYDHAHKLT-UHFFFAOYSA-N
).@kaliif please could you investigate how the four compounds from this C1965_poses_example.sdf.zip are all assigned different compound ID's?