Open mwinokan opened 1 year ago
@matteoferla on slack:
In terms of the current RHS uploader, causing issues with data preservation has the following issues. Data gets lost due to the following reasons.
Nevertheless, the issue is that if we keep information in the LHS of Fragalysis or in a record somewhere, once it progresses this information is not linked (e.g. that x2135_0A is ASAP-0013759 and was from the OE ROCS method is lost and would be inferable only if someone asked or scraped all uploaded Fragalysis compound-datasets)
@phraenquex on slack
Page in discussion: https://fragalysis.diamond.ac.uk/viewer/upload_cset/ Spec version 1.2 documentation: https://discuss.postera.ai/t/providing-computed-poses-for-others-to-look-at/1155 This is the code I use to prep for upload: https://github.com/matteoferla/Fragment-hit-follow-up-chemistry/blob/main/followup/prep_fragalysis.py
Version 1.2 requires a header molecules. This is annoying, so should be replaced by input forms, keeping the original fields (or making some optional):
ref_url
is not being used as originally intended as it currently generally is left blank or set to a GitHub repo. Only me of the design squad posted to https://discourse.xchem.diamond.ac.uk/ . However, this is because of visibility of the link not of anything else: it should be kept and the latter addressed in the LHS.
All these are text-type inputs.
A really useful feature would be to have a log-in button as opposed to a forbidden. (ATM one needs to go to main page, log in and go back to cset upload by URL bar)
Controversially, a kill list: a text-type input that accept comma/space-separated that need to be ignored. Cf. next section.
A common case is when the whole dataset has the same ref_pdb
and ref_mol
.
An optional field for both that is dealt with in the backend.
The backend could the following fixes:
ref_mol
and ref_pdb
if absentInvalid reference hits:
These may be:
Fixing 1-3 in the backend is dead easy. Case 4 is an issue.
A. Tolerate foreign reference hits and fix the frontend for it B. Help user kill foreign reference hits
Regarding B. Removing foreign reference hits blindly is what everyone does. This requires copy pasting iteratively from the error table (a faff to the power of faff). Ideally, on error due to bad reference hit names all should be checked and tallied and shown: How common is this bad reference (ie. whole dataset is wrong or only a few). Once the user decides it is the latter, were they given a kill list that can be copy pasted into the aforementioned kill list input box. This is safer than a checkbox that says ignore all invalid entries as the dataset might be wholly wrong.
I should say that a lot of this is currently done at the user side (cf. my function to fix it)
@phraenquex asked for ontology to codify design rationales.
This is useful but hard to encode properly, maintain & spur user adoption. Simplest case text-type input box for comma separated values defined in a tooltip of a (?) icon with some JS validation.
This would have really useful for the moonshot metadata analysis: I spent ages trying to covert words into a category. Possibly relevant to discussion with Conor to make data more accessible and useful to community.
For the asap.postera.ai upload I started fleshing a system for a restricted keyword system to codify design rationales. A prelude to an ontology. Namely in https://github.com/matteoferla/EV-D68-3C-protease/blob/main/XChem2Postera-corrections/xchem2asap.ipynb
This is simply a comma separated string with a subset of the values de-novo,AI,analog-by-catalog,catalog-enumeration,substructure-based,shape&color,by-eye,catalog-enumeration,de-novo,linker,merger,analog-by-catalog,Fragmenstein,ROCS,STRIFE,SILVR,Arthor,fragnet
. This could work on first pass as a input field that gets validated on the frontend based on an array to passed by the backend. However, controlling an editable controlled vocabulary is hard (cf. tags in Slack). Were some tags an ontology, ie. fragnet
means fragmentknit
which means catalog-enumeration,linker,substructure-based
then chosing of said tags would be easy. But say someone comes along and uses autogrow4
and wants the tag genetic-algorithm
then this would cause issues.
Let's revisit this later as the plan is still changing.
An addendum. It would be highly constructive if on cset dataset upload a new post was created in Discourse, with comments added if modified or deleted. That way there will be a definite place for discussion on a RHS dataset.
Thanks @matteoferla. Yes, let's do the Discourse thing anyway - but make it a generalisable code, for when we adopt some other discussion thread engine.
Ontology - if it's not controversial, I suggest doing something and then see how it goes. I'm aware it's difficult to construct forward-compatible ontologies, but I'm pretty sure we should at least be trying, now. @mwinokan and @Waztom, can you put your heads to this too.
From a conversation with @Waztom today:
What remains is the upload UI and metadata/ design rationale ontology. We need to decide whether the UI needs to be handled by the f/e rather than b/e (Django) in order to provide more functionality to add metadata without the need for a header molecule
Another one. Compulsory Original SMILES crashed with a value of 254. Name is no space and 50 max. I am not sure this is the same max
@mwinokan — Correction regarding text as values. It does not break the sorter (nan values do) or subsequent values. While words longer than 256 break the uploads, they work and as shown as the word before the space
Code used for testing: https://gist.github.com/matteoferla/45b98a4de5849ff1f413f5f14e5a4292 Which I tried on https://fragalysis.diamond.ac.uk/viewer/react/preview/target/NUDT7A/tas/lb18145-1
If going to keep using Discourse we need to revisit the implementation and make sure that it's in sync with backend DB because that's a big issue and the reason why I started to ignore it on dev stacks.
Might be dependent in #1222
Copied from this slack thead:
The idea behind the "first molecule" line was mine, it's to make the SDF file is self-documenting - it puts the schema in the same file as the data. That allows the person who generates each SDF file to adapt the schema as they require.
This does make the SDF "non-standard", to the extent that a standard exists: all rows are meant to be uniformly structured. But of course, since no standard exists (i.e. no schema with which to validate), only some conventions (as far as I can tell), one can in principle do what one wants, and it's the parser that imposes the schema.
If there is a better way to define a schema, and ensure that it is never separated from the data, then by all means, let's use that. Maybe there are things one can do with the end of the file? Or a very specific comment format? (Certainly these SDF files will break many SDF readers, as far as I remember what @tdudgeon told me back in 2020.)
I'm also conscious that formats like XML have defined schema languages - I last worked with this 20 years ago, so I don't know what's state of the art, but at the time, xml-schema seemed powerful but was faffy to write and validate. I don't know if something equivalent exists for SDF, but if it does, we could consider using it.
But what we really need is a robust validator tool. "robust" = "informative error reporting". That could be part of the upload, or a separate service we provide (I'm guessing the former is simpler)
all rows are meant to be uniformly structured
Only for specific readers, and some like rdkit.Chem.PandasTools
will simply complain, whereas most will not notice it.
This is an issue for SDF generated directly from a pandas dataframe (RDKit PandasTools), where you end up with:
But if the SDF is made with RDKit SDWriter or similar tool, wherein the first molecule is added and then the rest this does not happen:
I have shared several times code to make Fragalysis uploads look okay, but I failed in getting it adopted.
If v1.2 were maintained, then a solution would be simply that the uploader disregards compulsory metadata intended for the header molecule when parsing the data molecules.
The in metadata-containing file requirement is a tricky problem, so the v1.2 solution is not the worst solution by a lot.
SDF format does not allow interspersed bash-style comments, but the entries are often read sequentially with some leeway for bad entries, so adding # comments at the tail will work in some readers even if they are not read. However, the header molecules solution is more elegant than that.
XML, or it's modern simplification, YAML, would cause more issues than its worth and it would need to be fixed.
But is the metadata-containing file requirement a hard requirement for upload or for data storage within Fragalysis backend? If the latter only, then making the uploader fix it would circumvent the discussion...
@matteoferla can you summarise whether there is therefore a problem that needs addressing in the updated format? It sounds like there isn't - which is a relief.
@phraenquex —Under the assumption that the premise that the user needs to have metadata in a file for their own data integrity is valid (^1),
The bare minimum easy win is B/E filtering out of ref_url
, submitter_name
, submitter_email
, submitter_institution``generation_date
, method
in non-header entries in the SDF. This would remove the issue of RDKit.Chem.PandasTools
.
However, the bugs in the B/E uploader discussed earlier would still need to be addressed at some point...
^1 As said, I don't follow the logic as opposed to having the F/E be inputted the metadata as fields akin to asap.postera.ai for example
Usually, Bad Things Happen when the data spec doesn't travel with the data. As far as I can tell. So it does need to be a file.
At the time we did the spec, it felt easier to have spec and data in one SDF file. But of course, they come into a zip archive anway, with all the ref_pdbs
- so if it's easier to have two files, one with data and one with spec, and have them both in the zip, then I'm fine with that too.
I suggest @matteoferla and @mwinokan brainstorm this when they're back, and finalise the spec either way. As long as this condition is satisfied. (We auto-document the downloads too, with a PDF file.)
Original slack thread: https://asapdiscovery.slack.com/archives/C05CYLPSD5L/p1697710819367459
More details in replies below: