Rework RHS Data Upload - Githubissues

mwinokan commented 1 year ago

Original slack thread: https://asapdiscovery.slack.com/archives/C05CYLPSD5L/p1697710819367459

More details in replies below:

mwinokan commented 1 year ago

@matteoferla on slack:

In terms of the current RHS uploader, causing issues with data preservation has the following issues. Data gets lost due to the following reasons.

Fragalysis accepts reference molecules, but need to be present in the LHS (e.g. PDB hits need removal)
Presence of fields mentioned in the header that are not numerical (one single nan/null will kill it) make the whole table in Fragalysis not load
(The compulsory fields for the header are not numerical, but don't and should not feature in the other molecules or will mess the displayed table)
The virtual hit name has a limit of 50 characters, so it's not viable to keep info there

Nevertheless, the issue is that if we keep information in the LHS of Fragalysis or in a record somewhere, once it progresses this information is not linked (e.g. that x2135_0A is ASAP-0013759 and was from the OE ROCS method is lost and would be inferable only if someone asked or scraped all uploaded Fragalysis compound-datasets)

mwinokan commented 1 year ago

@phraenquex on slack

review the upload format, possibly completely
include ontology to codify design rationales
include frontend tools to usefully access design rationales

matteoferla commented 1 year ago

Page in discussion: https://fragalysis.diamond.ac.uk/viewer/upload_cset/ Spec version 1.2 documentation: https://discuss.postera.ai/t/providing-computed-poses-for-others-to-look-at/1155 This is the code I use to prep for upload: https://github.com/matteoferla/Fragment-hit-follow-up-chemistry/blob/main/followup/prep_fragalysis.py

Version 1.2 requires a header molecules. This is annoying, so should be replaced by input forms, keeping the original fields (or making some optional):

ref_url - the url to the discourse post that describes the work
submitter_name - the name of the person submitting the compounds
submitter_email - the email address of the submitter
submitter_institution - the submitters institution
generation_date - the date that the file was generated in format yyyy-mm-dd
method - a name for the method used to generate the compound poses

ref_url is not being used as originally intended as it currently generally is left blank or set to a GitHub repo. Only me of the design squad posted to https://discourse.xchem.diamond.ac.uk/ . However, this is because of visibility of the link not of anything else: it should be kept and the latter addressed in the LHS.

Frontend Uploader

All these are text-type inputs.

A really useful feature would be to have a log-in button as opposed to a forbidden. (ATM one needs to go to main page, log in and go back to cset upload by URL bar)

Controversially, a kill list: a text-type input that accept comma/space-separated that need to be ignored. Cf. next section.

Common info

A common case is when the whole dataset has the same ref_pdb and ref_mol. An optional field for both that is dealt with in the backend.

Frontend Target LHS

The LHS on a target needs to not be half dead on alphanumerics —will elaborate later. NaNs should be fixed backend —unless the table framework has a flag to accept NaNs, could someone check??
The metadata could be shown above table

Backend

The backend could the following fixes:

Not say Upload Failed when it actually worked (i.e. no error table)
cut names to 50 characters
be case insensitive with reference hit names (Fragalysis shows them in uppercase)
Fix NaN and None values
Amend common entry ref_mol and ref_pdb if absent

Invalid reference hits:

These may be:

Target name affixed hits. PROTEIN-x1234_A0 is the hit name in downloads, but not for Fragalysis
Ligands with subscript: x1234_A0.synthon1.
Compound not crystal, e.g. x1234 not x1234_A0
Foreign ligands, eg. from the PDB and absent in Fragalysis, entries that are from PanDDA but not in Fragalysis

Fixing 1-3 in the backend is dead easy. Case 4 is an issue.

A. Tolerate foreign reference hits and fix the frontend for it B. Help user kill foreign reference hits

Regarding B. Removing foreign reference hits blindly is what everyone does. This requires copy pasting iteratively from the error table (a faff to the power of faff). Ideally, on error due to bad reference hit names all should be checked and tallied and shown: How common is this bad reference (ie. whole dataset is wrong or only a few). Once the user decides it is the latter, were they given a kill list that can be copy pasted into the aforementioned kill list input box. This is safer than a checkbox that says ignore all invalid entries as the dataset might be wholly wrong.

I should say that a lot of this is currently done at the user side (cf. my function to fix it)

Ontology

@phraenquex asked for ontology to codify design rationales.

This is useful but hard to encode properly, maintain & spur user adoption. Simplest case text-type input box for comma separated values defined in a tooltip of a (?) icon with some JS validation.

This would have really useful for the moonshot metadata analysis: I spent ages trying to covert words into a category. Possibly relevant to discussion with Conor to make data more accessible and useful to community.

For the asap.postera.ai upload I started fleshing a system for a restricted keyword system to codify design rationales. A prelude to an ontology. Namely in https://github.com/matteoferla/EV-D68-3C-protease/blob/main/XChem2Postera-corrections/xchem2asap.ipynb This is simply a comma separated string with a subset of the values de-novo,AI,analog-by-catalog,catalog-enumeration,substructure-based,shape&color,by-eye,catalog-enumeration,de-novo,linker,merger,analog-by-catalog,Fragmenstein,ROCS,STRIFE,SILVR,Arthor,fragnet. This could work on first pass as a input field that gets validated on the frontend based on an array to passed by the backend. However, controlling an editable controlled vocabulary is hard (cf. tags in Slack). Were some tags an ontology, ie. fragnet means fragmentknit which means catalog-enumeration,linker,substructure-based then chosing of said tags would be easy. But say someone comes along and uses autogrow4 and wants the tag genetic-algorithm then this would cause issues.

Postera integration

Let's revisit this later as the plan is still changing.

matteoferla commented 1 year ago

An addendum. It would be highly constructive if on cset dataset upload a new post was created in Discourse, with comments added if modified or deleted. That way there will be a definite place for discussion on a RHS dataset.

phraenquex commented 1 year ago

Thanks @matteoferla. Yes, let's do the Discourse thing anyway - but make it a generalisable code, for when we adopt some other discussion thread engine.

Ontology - if it's not controversial, I suggest doing something and then see how it goes. I'm aware it's difficult to construct forward-compatible ontologies, but I'm pretty sure we should at least be trying, now. @mwinokan and @Waztom, can you put your heads to this too.

mwinokan commented 1 year ago

From a conversation with @Waztom today:

Inspirations must be present in the LHS
- This is a rigid part of the spec, and we should not be supporting references that are not in Fragalysis. So either need to upload PDB hits to the left or include the external inspiration in the metadata somewhere
The name character limit should be addressed in the auto-compound-naming ticket #1121

What remains is the upload UI and metadata/ design rationale ontology. We need to decide whether the UI needs to be handled by the f/e rather than b/e (Django) in order to provide more functionality to add metadata without the need for a header molecule

matteoferla commented 1 year ago

Another one. Compulsory Original SMILES crashed with a value of 254. Name is no space and 50 max. I am not sure this is the same max

matteoferla commented 1 year ago

@mwinokan — Correction regarding text as values. It does not break the sorter (nan values do) or subsequent values. While words longer than 256 break the uploads, they work and as shown as the word before the space

Code used for testing: https://gist.github.com/matteoferla/45b98a4de5849ff1f413f5f14e5a4292 Which I tried on https://fragalysis.diamond.ac.uk/viewer/react/preview/target/NUDT7A/tas/lb18145-1

boriskovar-m2ms commented 1 year ago

If going to keep using Discourse we need to revisit the implementation and make sure that it's in sync with backend DB because that's a big issue and the reason why I started to ignore it on dev stacks.

phraenquex commented 1 year ago

Might be dependent in #1222

phraenquex commented 11 months ago

Copied from this slack thead:

The idea behind the "first molecule" line was mine, it's to make the SDF file is self-documenting - it puts the schema in the same file as the data. That allows the person who generates each SDF file to adapt the schema as they require.

This does make the SDF "non-standard", to the extent that a standard exists: all rows are meant to be uniformly structured. But of course, since no standard exists (i.e. no schema with which to validate), only some conventions (as far as I can tell), one can in principle do what one wants, and it's the parser that imposes the schema.

If there is a better way to define a schema, and ensure that it is never separated from the data, then by all means, let's use that. Maybe there are things one can do with the end of the file? Or a very specific comment format? (Certainly these SDF files will break many SDF readers, as far as I remember what @tdudgeon told me back in 2020.)

I'm also conscious that formats like XML have defined schema languages - I last worked with this 20 years ago, so I don't know what's state of the art, but at the time, xml-schema seemed powerful but was faffy to write and validate. I don't know if something equivalent exists for SDF, but if it does, we could consider using it.

But what we really need is a robust validator tool. "robust" = "informative error reporting". That could be part of the upload, or a separate service we provide (I'm guessing the former is simpler)

matteoferla commented 11 months ago

all rows are meant to be uniformly structured

Only for specific readers, and some like rdkit.Chem.PandasTools will simply complain, whereas most will not notice it. This is an issue for SDF generated directly from a pandas dataframe (RDKit PandasTools), where you end up with: But if the SDF is made with RDKit SDWriter or similar tool, wherein the first molecule is added and then the rest this does not happen: I have shared several times code to make Fragalysis uploads look okay, but I failed in getting it adopted.

If v1.2 were maintained, then a solution would be simply that the uploader disregards compulsory metadata intended for the header molecule when parsing the data molecules.

The in metadata-containing file requirement is a tricky problem, so the v1.2 solution is not the worst solution by a lot.

SDF format does not allow interspersed bash-style comments, but the entries are often read sequentially with some leeway for bad entries, so adding # comments at the tail will work in some readers even if they are not read. However, the header molecules solution is more elegant than that.

XML, or it's modern simplification, YAML, would cause more issues than its worth and it would need to be fixed.

But is the metadata-containing file requirement a hard requirement for upload or for data storage within Fragalysis backend? If the latter only, then making the uploader fix it would circumvent the discussion...

phraenquex commented 11 months ago

@matteoferla can you summarise whether there is therefore a problem that needs addressing in the updated format? It sounds like there isn't - which is a relief.

matteoferla commented 11 months ago

@phraenquex —Under the assumption that the premise that the user needs to have metadata in a file for their own data integrity is valid (^1), The bare minimum easy win is B/E filtering out of ref_url, submitter_name, submitter_email, submitter_institution``generation_date, method in non-header entries in the SDF. This would remove the issue of RDKit.Chem.PandasTools.

However, the bugs in the B/E uploader discussed earlier would still need to be addressed at some point...

^1 As said, I don't follow the logic as opposed to having the F/E be inputted the metadata as fields akin to asap.postera.ai for example

phraenquex commented 11 months ago

Usually, Bad Things Happen when the data spec doesn't travel with the data. As far as I can tell. So it does need to be a file.

At the time we did the spec, it felt easier to have spec and data in one SDF file. But of course, they come into a zip archive anway, with all the ref_pdbs - so if it's easier to have two files, one with data and one with spec, and have them both in the zip, then I'm fine with that too.

I suggest @matteoferla and @mwinokan brainstorm this when they're back, and finalise the spec either way. As long as this condition is satisfied. (We auto-document the downloads too, with a PDF file.)

m2ms / fragalysis-frontend

Rework RHS Data Upload #1192

Frontend Uploader

Common info

Frontend Target LHS

Backend

Ontology

Postera integration