Open rsanchezgarc opened 2 years ago
The option SDF of all ligands in the download menu only works for Mpro target
For example, here https://fragalysis.diamond.ac.uk/viewer/react/preview/target/DCP2B
As you can see from this non-Mpro download snapshot
https://fragalysis.diamond.ac.uk/viewer/react/download/tag/94c837d0-04cb-4fe0-9295-851c54befb5f
there is no
Whereas for Mpro it is available as you can see here
https://fragalysis.diamond.ac.uk/viewer/react/download/tag/3147681f-8f1e-42af-a5e3-5ee67baefe66
Mpro_combined.sdf
Could this have been resolved in the latest (pre-staging) code?
When I try and reproduce this on the latest development stack (https://fragalysis-boris-default.xchem-dev.diamond.ac.uk/) I can download structures for PHIPA for example and I get 92 structures in the download and a PHIA_combined.sdf
with 92 concatenated entries.
@alanbchristie The problems seems to remain.
I can download the combined sdf for mpro:
https://fragalysis.xchem.diamond.ac.uk/viewer/react/download/tag/c683f7c1-3c52-47f8-87b0-4daae793034b Mpro_combined.sdf
But not for any of the other targets I tried, for example https://fragalysis.xchem.diamond.ac.uk/viewer/react/download/tag/3f3ddb2a-3a2e-4514-a458-185c51cb1398
I notice that the Mpro SD files are located in two directories in the staging stack. In sdfs
and targets
: -
# find . -name "Mpro-x3351_0A*.sdf"
./sdfs/Mpro-x3351_0A_rtEVbqf.sdf
./targets/Mpro/aligned/Mpro-x3351_0A/Mpro-x3351_0A.sdf
The PGN files on the other hand only exist in one place (targets
)...
# find . -name "PGN_RS02895PGA-x0120_0A*.sdf"
./targets/PGN_RS02895PGA/aligned/PGN_RS02895PGA-x0120_0A/PGN_RS02895PGA-x0120_0A.sdf
Is it possible that the combined only works for new uploads?
The PGN is from March 2021
whereas Mpro is from this March 2022
. If so, maybe the best approach is, if all new uploads work, to run a task to copy tall the SD files from the targets directory to sdfs
. That's a lot simpler than writing convoluted code to search over several places for the files.
# stat ./targets/Mpro/aligned/Mpro-x3351_0A/Mpro-x3351_0A.sdf
File: ./targets/Mpro/aligned/Mpro-x3351_0A/Mpro-x3351_0A.sdf
Size: 1169 Blocks: 8 IO Block: 1048576 regular file
Device: 74h/116d Inode: 12062919 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-04-13 13:08:45.051601321 +0000
Modify: 2022-03-01 12:10:14.980550013 +0000
Change: 2022-04-13 13:08:45.075601601 +0000
# stat ./targets/PGN_RS02895PGA/aligned/PGN_RS02895PGA-x0120_0A/PGN_RS02895PGA-x0120_0A.sdf
File: ./targets/PGN_RS02895PGA/aligned/PGN_RS02895PGA-x0120_0A/PGN_RS02895PGA-x0120_0A.sdf
Size: 1584 Blocks: 8 IO Block: 1048576 regular file
Device: 74h/116d Inode: 11817859 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-03-11 05:08:38.608664133 +0000
Modify: 2021-03-10 10:48:15.278230332 +0000
Change: 2021-03-11 05:08:38.614664198 +0000
So - if new uploads contain a combined file when re-downloaded then my suggestion is to run a cp
command on the server and copy all the SD files from targets
to sdfs
.
Do new uploads work?
The combined file is controlled by a sdf_file
column in the Molecule
table of the database. This column appears to have been added in October 2021 (2021-10). It's likely that molecule records that predate this migration have sdf_file
set to Null
(the column can be Null
according to the django table declaration).
When you download the structures the download logic inspects the molecule record's sdf_file
column for a filename. If there is no filename no file is added to the combined file, and, if there are no files in any molecules in the download, there will be no combined file.
Ans.: I don't know, but my guess is that the Molecule
record existed prior to the 2021-10 database migration or was part of an upload where there was no SD file?
Ans.: Sometimes, yes.
In some cases a file does exist in the filesystem (see PGN_RS02895PGA-x0120_0A
from the example in the prior comment where the file appears to be from March 2021). In these cases the download logic could be adapted so that when sdf_file
is Null
it searches for it on the media volume (in this case it can be found in the file targets/PGN_RS02895PGA/aligned/PGN_RS02895PGA-x0120_0A/PGN_RS02895PGA-x0120_0A.sdf
Ans.: Don't know.
Ans. Because in some cases (e.g. CAMK1DA-x0321_1
) there is no SD file in the file-system. There just isn't an SD file that can be found.
Questions...
As we cannot offer a "universal" fix (i.e. there are some molecules where SD files do not exist) should we be filling in the gaps for those that we can find? Wouldn't it simply be better to issue a "download error" - i.e. "you asked for a combined SD file but I couldn't create one". [preferred]
If "filling in the gaps" (where we can) is considered the best solution to this problem (i.e. rather than uploading new data - if that's possible) then is the first file that can be found that matches the protein code (i.e. the first file that matches the pattern <protein-code>.sdf
) safe?
Frank's policy is to make the system as robust as we can, and if we can provide a partial solution is better than nothing. Nevertheless, it would be great if we could tell the data keepers that we have some inconsistency.
We can consider it safe. Although I wouldn't like to patch the database with the file you found. I would prefer to repeat the file search rather than add more problematic data to the database
Update after discussion at weekly meeting:
OK - annoyingly I'm 98% of the way through the development and testing of current "fishing" solution (i.e. find the file in the targets directory). It just extends how the download logic was originally working so I am reluctant to delete all the code changes I've made so far.
Our conversation yesterday makes me suspect that whatever solution you code up based on the find-the-files approach, has a decent chance of not robustly solving the problem. The problem being, specifically, that the user wants the files for what they see in the UI (rather than what you're trying to extract heuristically (and heroically) from what's on disk.
So... though I understand you reluctance (been there...), be aware that it might still leave things unsolved.
Of course: by all means finish the last 2% for this bit, but then please do investigate too the backend-logic.
@rsanchezgarc, please comment/advise.
The half-way house might be:
Combined SDF file is only being generated for mpro in the download functionality