Download Integrity Checks:

There have been some changes to the formats of the target zip files in the history of Fragalysis (for example adding new file types). To show that the new data download service works as desired, it is necessary to also check that the new code will work with these historical target sets.

Discussion:

A unit test has been included already in the package to test the upload and check that the number of loaded proteins is as expected. What we will be effectively doing here is not only proving that the new API works on existing data, but also checking that existing target data is consistent between the database with the files in the media directory.

As this test will not be destructive and could be used in future on production, it would be sensible to make the test re-runnable and built in a way that could be potentially built into a periodic integrity to also check that new data is consistent. The objective is different with these sorts of data integrity tests. It should be coded to continue through to the end rather than failing (as you would with a unit test)

It should also be noted that it might find data issues – a good thing – but we should be prepared to consider how to fix them!

Proposed solution/Tests:

It is possible in Django to write custom commands see here. This is simple framework that we can build on for the testing. For the initial run, it will be a case of jumping onto the stack pod and running it manually - but at some future point we could consider expanding the job and/or running it periodically.

Processing

Loop through all Targets:

For each target – do three main tests:

Call the (new) download structures API for the target with (almost) all boxes ticked to get a zip file for all the proteins for the target:

'pdb_info', 'bound_info','cif_info','mtz_info','diff_info','event_info','sigmaa_info','trans_matrix_info',sdf_info','smiles_info' Note that I assume that the metadata file is not really important here.
Loop through all proteins for the target in the database: For each file field (as above) - If it is set, check the file is in the zip file in the right directory.
If there is a problem log any issues in an error file.
The number of smiles columns should also match the number of molecules.
(Optional). Call the existing download api to get the full zip file – and count the folders in the Aligned directory – this should match the number of proteins and the pdb_info files (and indeed the names should match). This has the advantage of checking that the database hasn't deviated from the zip files.

Reporting:

This will be in the form of a csv file produced for each target + some overall stats. Also, the extracted downloaded structures directory for each target will be available.

Estimate:

I think this will take around 3 days (there are some 'known unknowns' here - including unexpected errors caused by production data) not including any data fixes (we won't know the scope of these until we run it)

Additional: Send Tyler & Frank a notification email when test fails (with summary)

duncanpeacock commented 2 years ago

Crystallographic stuff added to #649

phraenquex commented 2 years ago

Also in scope: migrate new download mechanisms to historic targets. (These currently have a work-around that Rachael put in.)

duncanpeacock commented 2 years ago

Moved to #649.

duncanpeacock commented 2 years ago

Crystallographic download added to the data upload epic #649 as agreed in meeting of 26/01/2022.

phraenquex commented 8 months ago

Fixed in V2

m2ms / fragalysis-frontend

Data integrity/consistency checks and Addition of crystallographic data to download. #673

Download Integrity Checks:

Discussion:

Proposed solution/Tests:

Processing

Reporting:

Estimate: