Combined SDF file is only being generated for mpro

rsanchezgarc commented 2 years ago

Combined SDF file is only being generated for mpro in the download functionality

rsanchezgarc commented 2 years ago

The option SDF of all ligands in the download menu only works for Mpro target

For example, here https://fragalysis.diamond.ac.uk/viewer/react/preview/target/DCP2B

As you can see from this non-Mpro download snapshot

https://fragalysis.diamond.ac.uk/viewer/react/download/tag/94c837d0-04cb-4fe0-9295-851c54befb5f

there is no _combined.sdf file

Whereas for Mpro it is available as you can see here

https://fragalysis.diamond.ac.uk/viewer/react/download/tag/3147681f-8f1e-42af-a5e3-5ee67baefe66

Mpro_combined.sdf

alanbchristie commented 2 years ago

Could this have been resolved in the latest (pre-staging) code?

When I try and reproduce this on the latest development stack (https://fragalysis-boris-default.xchem-dev.diamond.ac.uk/) I can download structures for PHIPA for example and I get 92 structures in the download and a PHIA_combined.sdf with 92 concatenated entries.

rsanchezgarc commented 2 years ago

@alanbchristie The problems seems to remain.

I can download the combined sdf for mpro:

https://fragalysis.xchem.diamond.ac.uk/viewer/react/download/tag/c683f7c1-3c52-47f8-87b0-4daae793034b Mpro_combined.sdf

But not for any of the other targets I tried, for example https://fragalysis.xchem.diamond.ac.uk/viewer/react/download/tag/3f3ddb2a-3a2e-4514-a458-185c51cb1398

alanbchristie commented 2 years ago

I notice that the Mpro SD files are located in two directories in the staging stack. In sdfs and targets: -

# find . -name "Mpro-x3351_0A*.sdf"
./sdfs/Mpro-x3351_0A_rtEVbqf.sdf
./targets/Mpro/aligned/Mpro-x3351_0A/Mpro-x3351_0A.sdf

The PGN files on the other hand only exist in one place (targets)...

# find . -name "PGN_RS02895PGA-x0120_0A*.sdf"
./targets/PGN_RS02895PGA/aligned/PGN_RS02895PGA-x0120_0A/PGN_RS02895PGA-x0120_0A.sdf

Is it possible that the combined only works for new uploads?

The PGN is from March 2021 whereas Mpro is from this March 2022. If so, maybe the best approach is, if all new uploads work, to run a task to copy tall the SD files from the targets directory to sdfs. That's a lot simpler than writing convoluted code to search over several places for the files.

# stat ./targets/Mpro/aligned/Mpro-x3351_0A/Mpro-x3351_0A.sdf
  File: ./targets/Mpro/aligned/Mpro-x3351_0A/Mpro-x3351_0A.sdf
  Size: 1169        Blocks: 8          IO Block: 1048576 regular file
Device: 74h/116d    Inode: 12062919    Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-04-13 13:08:45.051601321 +0000
Modify: 2022-03-01 12:10:14.980550013 +0000
Change: 2022-04-13 13:08:45.075601601 +0000

# stat ./targets/PGN_RS02895PGA/aligned/PGN_RS02895PGA-x0120_0A/PGN_RS02895PGA-x0120_0A.sdf
  File: ./targets/PGN_RS02895PGA/aligned/PGN_RS02895PGA-x0120_0A/PGN_RS02895PGA-x0120_0A.sdf
  Size: 1584        Blocks: 8          IO Block: 1048576 regular file
Device: 74h/116d    Inode: 11817859    Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-03-11 05:08:38.608664133 +0000
Modify: 2021-03-10 10:48:15.278230332 +0000
Change: 2021-03-11 05:08:38.614664198 +0000

So - if new uploads contain a combined file when re-downloaded then my suggestion is to run a cp command on the server and copy all the SD files from targets to sdfs.

Do new uploads work?

alanbchristie commented 2 years ago

The combined file is controlled by a sdf_file column in the Molecule table of the database. This column appears to have been added in October 2021 (2021-10). It's likely that molecule records that predate this migration have sdf_file set to Null (the column can be Null according to the django table declaration).

When you download the structures the download logic inspects the molecule record's sdf_file column for a filename. If there is no filename no file is added to the combined file, and, if there are no files in any molecules in the download, there will be no combined file.

Why is the sdf column empty?

Ans.: I don't know, but my guess is that the Molecule record existed prior to the 2021-10 database migration or was part of an upload where there was no SD file?

Can we "fill-in the gaps"?

Ans.: Sometimes, yes.

In some cases a file does exist in the filesystem (see PGN_RS02895PGA-x0120_0A from the example in the prior comment where the file appears to be from March 2021). In these cases the download logic could be adapted so that when sdf_file is Null it searches for it on the media volume (in this case it can be found in the file targets/PGN_RS02895PGA/aligned/PGN_RS02895PGA-x0120_0A/PGN_RS02895PGA-x0120_0A.sdf

Is "filling in the gaps" safe?

Ans.: Don't know.

Why only sometimes?

Ans. Because in some cases (e.g. CAMK1DA-x0321_1) there is no SD file in the file-system. There just isn't an SD file that can be found.

alanbchristie commented 2 years ago

Questions...

As we cannot offer a "universal" fix (i.e. there are some molecules where SD files do not exist) should we be filling in the gaps for those that we can find? Wouldn't it simply be better to issue a "download error" - i.e. "you asked for a combined SD file but I couldn't create one". [preferred]
If "filling in the gaps" (where we can) is considered the best solution to this problem (i.e. rather than uploading new data - if that's possible) then is the first file that can be found that matches the protein code (i.e. the first file that matches the pattern <protein-code>.sdf) safe?

rsanchezgarc commented 2 years ago

Frank's policy is to make the system as robust as we can, and if we can provide a partial solution is better than nothing. Nevertheless, it would be great if we could tell the data keepers that we have some inconsistency.
We can consider it safe. Although I wouldn't like to patch the database with the file you found. I would prefer to repeat the file search rather than add more problematic data to the database

phraenquex commented 2 years ago

Update after discussion at weekly meeting:

look at the Backend logic for when it serves the SDF files to the front-end; this is the logic that the download code should use as well. (Boris confirms there is no error-catching logic for handling SDFinfo, so we conclude that must exist in the backend.)
@rsanchezgarc says that the CAMK1DA-x0321_1 example does correctly render in the frontend, even though it's missing from the data directory - so good example of the problem and suggested solution.

alanbchristie commented 2 years ago

OK - annoyingly I'm 98% of the way through the development and testing of current "fishing" solution (i.e. find the file in the targets directory). It just extends how the download logic was originally working so I am reluctant to delete all the code changes I've made so far.

phraenquex commented 2 years ago

Our conversation yesterday makes me suspect that whatever solution you code up based on the find-the-files approach, has a decent chance of not robustly solving the problem. The problem being, specifically, that the user wants the files for what they see in the UI (rather than what you're trying to extract heuristically (and heroically) from what's on disk.

So... though I understand you reluctance (been there...), be aware that it might still leave things unsolved.

Of course: by all means finish the last 2% for this bit, but then please do investigate too the backend-logic.

@rsanchezgarc, please comment/advise.

The half-way house might be:

finish @alanbchristie 's find-the-files approach and release
make (and prioritise) a new ticket to investigate, and possibly implement, the do-what-the-backend-does approach

m2ms / fragalysis-frontend