Fix target upload (epic 3)

phraenquex commented 3 years ago

Ticket for frontend work: #540

This ticket currently covers mainly backend work (@duncanpeacock ), including: Things to fix

Handling very large files
Versioning & release
- for review: data staging instance
- for data consistency: support updates (instead of full re-uploads)
ownership of data/uploads

duncanpeacock commented 3 years ago

Tyler and I had a meeting on Thursday on this. Attached is an analysis with a solution outline and some estimates: https://docs.google.com/document/d/1T5RV4TzzwShdR5wNMXe6Nx2gdS-HaVLE4EXxCufnKzw/edit?usp=sharing

phraenquex commented 3 years ago

For "deleting" structures, 3 main actions:

metadata file needs a new column that communicates either deletedness or supercededness
API needs to discover all snapshots and discourse posts with the culprit, and add a "contains dodgy structures" thing.
Frontend needs to have a "terminal" / "messaging" functionality where it can prominently display existence of culprits. (Different messages for "deleted" or "superceded", but always needs to be communicated.)

duncanpeacock commented 2 years ago

The solution design document has been updated with the delete processing: https://docs.google.com/document/d/1T5RV4TzzwShdR5wNMXe6Nx2gdS-HaVLE4EXxCufnKzw/edit?usp=sharing

phraenquex commented 2 years ago

@duncanpeacock - if you haven't yet, also spec the mechanism for communicating errors back to the uploader.

A specific error: dataset ID not unique. (That's the "X0001" or "P0001" number.)

duncanpeacock commented 2 years ago

From #673

Include Crystallographic files

Currently the upload process only stores files from the aligned directory in the database.. The download process as designed currently only picks files from these fields - following the design decision to keep the process as simple as possible.

This will have to be modified to properly store the files from the crystallographic folder in the database as part of the target upload process. Then this could be access as part of the download in a similar way to the current aligned files. Unlike the aligned folder, we want to make this flexible so that new file types can be uploaded in the target loader without code changes.

Crystallographic mapping:

Aligned Crystallographic Mpro-x0072_0A - Mpro-x0072 Mpro-x0072_1A - Mpro-x0072 Mpro-x0104_0A - Mpro-x0104

So the files in the Crystallographic folder can be accessed using the base Crystal name (stem of the protein code without the _0A, _1A etc)

We would add a new Crystallographic table with an array of links to an associated files table that contains a mapping to indicate the file template to identify the file in the crystallographic folder.

Name of Crystal, Many2Many (id, Target, File (FileField), FileTypeMapping)

File-FileTypeMapping Mpro-x0072, Mpro-x0072.pdb, PDB

When the target loader runs, it will load the Crystallographic and files tables. If a protein code is marked as changed in the metadata then both the associated aligned AND crystallographic files are updated

In the download structures window, when the crystallographic structures are selected, the API will extract the crystal name from the protein codes provided and supply the desired files from the database - adding them to the crystallographic folder in a sub-directory labelled with the source protein (e.g. Mpro-x0072).

The new fields will be:

PDB - format: {source protein}.pdb
MTZ - format: {source protein}.TBC
Event MTZ - format: {source protein}.TBC
Raw ccp4 map files; format: {source protein}.TBC

And one other point:

Raise errors in the error file to list proteins where the requested file does not exist - e.g: if we request to retrieve event MTZ files but one is missing for a protein, then the errors-csv file will have that code added to it.

If so I can add this to the design document.

phraenquex commented 2 years ago

This epic should include versioning, and that's part of the schema update/redesign.

phraenquex commented 1 year ago

For versioning - brainstorm by @phraenquex, @tdudgeon, Daren

All uploaded data and annotations need a datestamp
The upload event needs a datestamp and owner
Only project editors (authorised by target_access_string) can upload or release
- Anyone that can view, can also edit. No distinct edit/view roles currently envisaged - out of scope
Release will be a separate event, datestamp and releaser to be recorded
- (requires new API)
all project viewers/editors to be alerted by email for both upload and release
- for Diamond, email should be available from UAS
Damage limitation for mis-released data:
- Mis-release will be knowable through the email alert
- offending data can be labelled as obsolete by (re)upload of the metadata
- this will cascade into jobs, snapshots, downloads being flagged by F/E (ticket #540)

tdudgeon commented 1 year ago

Initial high level spec for the new loaders: https://docs.google.com/document/d/1osK1mbaO5TrNRY8-0P5piiYodaEA_z_sgU_7SmjfzHA/edit#

phraenquex commented 1 year ago

Work remaining for epic: 1. Complete XChemAlign #999 (May 20?)

Complete database schema (Django) #1008 (Jun 2nd)
Rewrite (implement new) target loader #1055
Finalise B/E higher-level APIs #1056 5. Rewrite F/E #540 (???)
Testing
Download (scope out)

Things @tdudgeon has worried about:

Historical data - parallel behaviour? update existing? (100+ projects)
Non-Diamond data - desnsity for PDB files - download maps from PDB (envisioned XCAlign v2)
Curating biomol assemblies etc. - generate easy-to-load-and-view scenes (pymol, coot, etc.)
~Visits handled differently in XCA & Fragalysis~ Not an issue - uploader decides on ONE visit at upload time.
CIF to MOL etc. Which tool? Discuss with Conor, can ask CCP4bb as well

phraenquex commented 1 year ago

Crystallographic files

Place-holders already in database, should hold and serve the files, but needn't digest it. Files should be in media dir, not in database (that would be a longterm risk)

Historic data

Two options:

Keep separate Fragalysis instance for historic data - redirect from landing page
Fix data-finding heuristics so they gracefully report to backend (and thus frontend via API). @alanbchristie to assess.

How to handle re-alignments of existing data

Parser of Align Output should assess whether new and meaningfully different from the old alignments. If not, toss the new one. It's Tim's side of the code that must do this.

Old-style data along-side new-style data

We won't try and have them co-exist; we'll have to re-upload existing data.
We'll need to think of a mechanism to transfer tags so they stay attached to the same compounds.

non-XChem PDB files

Do align them onto each site, whether or not they have ligands bound.

This may turn out to be a mistake, but let's go with it for now.
Will certainly need front-end to handle the likely too-many-models problem - though #540 ought to do it correctly
Should we do this for all sites of every XChem model too...? The structural variation is valuable info... @phraenquex to assess.

Where do soaked compounds come from

@tdudgeon and Daren to settle on the convention.
Might be available in SoakDB already - @phraenquex had previous discussed with Daren adding an extra column

Are uploads recorded as events

No - FE/BE API will use upload datestamps to allow FE to present it properly.

m2ms / fragalysis-frontend