labdao / plex

Platform for running comp bio applications on distributed compute and storage infrastructure
https://lab.bio
MIT License
55 stars 14 forks source link

Long jobs on PLEX complete, but do not download #153

Closed NiklasTR closed 1 year ago

NiklasTR commented 1 year ago

Running diffdock on 0.3.0 as instructed.

I have seen this when building from source on Ubuntu and when installing binaries on MacOSX

(base) rindtorff@niklas plex % ./plex -app diffdock -input-dir testdata/binding/abl -gpu=true -network=true
BACALHAU_API_HOST not set, using default host
## User input ##
Provided application name: diffdock
Provided directory path: testdata/binding/abl
Using GPU: true
Using Network: true
## Default parameters ##
Using app configs: config/app.jsonl
Setting layers to: 2
## Validating ##
App found: diffdock
## Searching input files ##
Found 3 matching files
testdata/binding/abl/7n9g.pdb
testdata/binding/abl/ZINC000003986735.sdf
testdata/binding/abl/ZINC000019632618.sdf
Created job directory /Users/rindtorff/plex/c14089f7-aa0e-43ca-948c-f157cd8929ca
added QmWmSf3hu78iVaWmDt1EVMGzxMfD6uPq9iPTbca7NVz4T6## Creating Bacalhau Job ##
Bacalhau Job Id: 2d6ffaaf-8132-4df5-acac-e00d8df50b93
Job running...
Your job results have been downloaded to /Users/rindtorff/plex/c14089f7-aa0e-43ca-948c-f157cd8929ca
(base) rindtorff@niklas plex % less /Users/rindtorff/plex/c14089f7-aa0e-43ca-948c-f157cd8929ca
(base) rindtorff@niklas plex % 
(base) rindtorff@niklas plex % ls /Users/rindtorff/plex/c14089f7-aa0e-43ca-948c-f157cd8929ca
7n9g.pdb                ZINC000003986735.sdf    ZINC000019632618.sdf    index.csv               index.jsonl

manually pulling the bacalhau result works as expected after setting the HOST

(base) rindtorff@niklas plex % export BACALHAU_API_HOST=54.210.19.52                 
(base) rindtorff@niklas plex % bacalhau get 2d6ffaaf-8132-4df5-acac-e00d8df50b93     
Fetching results of job '2d6ffaaf-8132-4df5-acac-e00d8df50b93'...

Computing default go-libp2p Resource Manager limits based on:
    - 'Swarm.ResourceMgr.MaxMemory': "8.6 GB"
    - 'Swarm.ResourceMgr.MaxFileDescriptors': 30720

Applying any user-supplied overrides on top.
Run 'ipfs swarm limit all' to see the resulting limits.

Results for job '2d6ffaaf-8132-4df5-acac-e00d8df50b93' have been written to...
/Users/rindtorff/plex/job-2d6ffaaf
(base) rindtorff@niklas plex % ls /Users/rindtorff/plex/job-2d6ffaaf
combined_results        per_shard               raw
(base) rindtorff@niklas plex % ls /Users/rindtorff/plex/job-2d6ffaaf/combined_results 
outputs stderr  stdout
(base) rindtorff@niklas plex % ls /Users/rindtorff/plex/job-2d6ffaaf/combined_results/outputs 
complex_names.npy                                               index1_..-inputs-7n9g.pdb____..-inputs-ZINC000019632618.sdf
confidences.npy                                                 min_self_distances.npy
esm2_output                                                     prepared_for_esm.fasta
index0_..-inputs-7n9g.pdb____..-inputs-ZINC000003986735.sdf     run_times.npy
(base) rindtorff@niklas plex % 
thetechnocrat-dev commented 1 year ago

Can't reproduce the error locally, when building from source. Will try with install script next.

Screen Shot 2023-03-08 at 1 53 20 PM
NiklasTR commented 1 year ago

Thank you - running it again here, too

NiklasTR commented 1 year ago

Ran it again and could not reproduce the error. Will put this on hold for now and try to reproduce it tomorrow

thetechnocrat-dev commented 1 year ago

worked for me too on the binary version. any chance your input files are empty in the place where it is not working?

I've noticed that curl will create the file but put a 404 error inside the file when it is not found.

NiklasTR commented 1 year ago

Currently seeing the issue again:

note that this job includes 2 proteins - so make sure to run it with the same inputs

(base) rindtorff@niklas plex % ./plex -app diffdock -input-dir testdata/binding/abl -gpu=true -network=true          
BACALHAU_API_HOST not set, using default host
## User input ##
Provided application name: diffdock
Provided directory path: testdata/binding/abl
Using GPU: true
Using Network: true
## Default parameters ##
Using app configs: config/app.jsonl
Setting layers to: 2
## Validating ##
App found: diffdock
## Searching input files ##
Found 3 matching files
testdata/binding/abl/7n9g.pdb
testdata/binding/abl/ZINC000003986735.sdf
testdata/binding/abl/ZINC000019632618.sdf
Created job directory /Users/rindtorff/plex/879801a8-08b6-4927-96dc-3f8f5702129c
added QmWmSf3hu78iVaWmDt1EVMGzxMfD6uPq9iPTbca7NVz4T6## Creating Bacalhau Job ##
Bacalhau Job Id: 60055584-eeaf-4d41-8124-df8874038174
Job running...
Your job results have been downloaded to /Users/rindtorff/plex/879801a8-08b6-4927-96dc-3f8f5702129c
(base) rindtorff@niklas plex % ls /Users/rindtorff/plex/879801a8-08b6-4927-96dc-3f8f5702129c/
(base) rindtorff@niklas plex % bacalhau describe 60055584-eeaf-4d41-8124-df8874038174
Job not found. ID: 60055584-eeaf-4d41-8124-df8874038174
(base) rindtorff@niklas plex % ls /Users/rindtorff/plex/879801a8-08b6-4927-96dc-3f8f5702129c/
7n9g.pdb                ZINC000003986735.sdf    ZINC000019632618.sdf    index.csv               index.jsonl

Job description:

            HAPPENING | confidence model uses different type of graphs than the score model. Loading (or creating if not existing) the data for the confidence model now.
            Reading molecules and generating local structures with RDKit (unless --keep_local_structures is turned on).
            Reading language model embeddings.
            Generating graphs for ligands and proteins
            loading data from memory:  data/cache_torsion_allatoms/limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_atomRad5_atomMax8_esmEmbeddings63314871/heterographs.pkl
            Number of complexes:  2
            radius protein: mean 49.7668571472168, std 0.0, max 49.7668571472168
            radius molecule: mean 9.761337280273438, std 0.4370088577270508, max 10.198346138000488
            distance protein-mol: mean 40.52650833129883, std 0.1205596923828125, max 40.64706802368164
            rmsd matching: mean 0.0, std 0.0, max 0
            common t schedule [1.   0.95 0.9  0.85 0.8  0.75 0.7  0.65 0.6  0.55 0.5  0.45 0.4  0.35
             0.3  0.25 0.2  0.15 0.1  0.05]
            Size of test dataset:  2
            Failed for 0 complexes
            Skipped 0 complexes
            Results are in ../outputs
          stdouttruncated: false
        ShardIndex: 0
        State: Completed
        UpdateTime: "2023-03-09T10:28:10.901179061Z"
        VerificationResult:
          Complete: true
          Result: true
        Version: 6
      JobID: 60055584-eeaf-4d41-8124-df8874038174
      ShardIndex: 0
      State: Completed
      UpdateTime: "2023-03-09T10:28:11.368770561Z"
      Version: 2
  State: Completed
  TimeoutAt: "0001-01-01T00:00:00Z"
  UpdateTime: "2023-03-09T10:28:11.368773261Z"
  Version: 2

And downloaded results

(base) rindtorff@niklas plex % bacalhau get 60055584-eeaf-4d41-8124-df8874038174
Fetching results of job '60055584-eeaf-4d41-8124-df8874038174'...

Computing default go-libp2p Resource Manager limits based on:
    - 'Swarm.ResourceMgr.MaxMemory': "8.6 GB"
    - 'Swarm.ResourceMgr.MaxFileDescriptors': 30720

Applying any user-supplied overrides on top.
Run 'ipfs swarm limit all' to see the resulting limits.

Results for job '60055584-eeaf-4d41-8124-df8874038174' have been written to...
/Users/rindtorff/plex/job-60055584
(base) rindtorff@niklas plex % ls /Users/rindtorff/plex/job-60055584/combined_results 
outputs stderr  stdout
(base) rindtorff@niklas plex % ls /Users/rindtorff/plex/job-60055584/combined_results/outputs 
complex_names.npy                                               index1_..-inputs-7n9g.pdb____..-inputs-ZINC000019632618.sdf
confidences.npy                                                 min_self_distances.npy
esm2_output                                                     prepared_for_esm.fasta
index0_..-inputs-7n9g.pdb____..-inputs-ZINC000003986735.sdf     run_times.npy
(base) rindtorff@niklas plex % 
NiklasTR commented 1 year ago

Found additional issue: Equibind does not process .mol2 files #158

Discovered another issue while debugging

Running another test on a Ubuntu instance (Jupyter Lab) with PLEX installed from source.

This time I am looping through a set of requests. I am seeing about 30+% failure rate when it comes to downloading the results.

ubuntu@ip-172-31-90-44:~/plex$ for dir in 6o9b 4ayt 5jh6 1p2a 3e73 4fz6 5kr2 4oz3 4ucd 2hz0 1dkd 3lxg; do echo "$dir,$(./plex -app equibind -input-dir "/home/ubuntu/PDBBind_processed/$dir" -gpu=false -network=false | grep "Your job results have been downloaded to" | awk '{print $NF}')"; done > job_results.csv

2023/03/09 10:41:50 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size for details.
2023/03/09 10:43:04 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size for details.
2023/03/09 10:43:17 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size for details.
2023/03/09 10:44:25 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size for details.
2023/03/09 10:44:36 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size for details.
2023/03/09 10:45:45 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size for details.
2023/03/09 10:45:56 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size for details.
2023/03/09 10:46:14 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size for details.
2023/03/09 10:47:22 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size for details.
2023/03/09 10:48:31 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size for details.
2023/03/09 10:48:39 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size for details.
2023/03/09 10:49:48 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size for details.
ubuntu@ip-172-31-90-44:~/plex$ 
ubuntu@ip-172-31-90-44:~/plex$ less job_results.csv
ubuntu@ip-172-31-90-44:~/plex$ while IFS=',' read -r dir job_dir; do [[ -n $(find "$job_dir/combined_results/outputs" -name "*.sdf" -print -quit) ]] && echo "$dir,TRUE" || echo "$dir,FALSE"; done < job_results.csv
6o9b,TRUE
4ayt,TRUE
5jh6,TRUE
1p2a,FALSE
3e73,TRUE
4fz6,TRUE
5kr2,TRUE
4oz3,FALSE
4ucd,TRUE
2hz0,TRUE
1dkd,FALSE
3lxg,TRUE
ubuntu@ip-172-31-90-44:~/plex$ 

(This is job results for reference)

6o9b,/home/ubuntu/plex/011418a4-84d6-499e-8db0-36365c3bf69e
4ayt,/home/ubuntu/plex/d15d12b8-729a-4b00-8fa5-87771c7d5a8d
5jh6,/home/ubuntu/plex/528d12a8-2b49-4c04-990c-cd761de7f60e
1p2a,/home/ubuntu/plex/26001ccc-1a9f-450d-bd2f-0df93de010aa
3e73,/home/ubuntu/plex/07687f73-c5ce-4745-8967-4f6abe9c3896
4fz6,/home/ubuntu/plex/4207fbfc-b11d-457f-b167-601ee3784293
5kr2,/home/ubuntu/plex/d7c1581f-862a-45e2-b515-09b32235d32d
4oz3,/home/ubuntu/plex/021b3fc0-5971-40ef-82e3-3cecf63511f5
4ucd,/home/ubuntu/plex/30628d20-2042-434f-a25b-48092743e493
2hz0,/home/ubuntu/plex/008f40f0-fc61-4305-a17f-7820ca7560a5
1dkd,/home/ubuntu/plex/70a99d67-70f3-4b1f-b522-6d6b3ca5f5af
3lxg,/home/ubuntu/plex/01885263-8b9e-41e3-a582-f9592f83d9fe

I am now checking wether the data can be downloaded via bacalhau

NiklasTR commented 1 year ago

I reran the same command another time on the same machine. I am getting the same pattern of missing files.

image

at this point it seems like the issue is not related to dropped downloads, but errors within the ligands and equibind. Digging deeper shows that the empty directories do not have any successful runs and thus an empty output directory. The prime reason is that equibind expects all files to end with .sdf and does not currently read .mol2 in our current configuration.

NiklasTR commented 1 year ago

We will need to ship a change to the equibind container or a QC checker for sdf files in order to run equibind more reliably. For the demo, we will drop the 3 complexes with dysfunctional sdf files from the analysis and continue working with 9 complexes.

NiklasTR commented 1 year ago

Back to the problem with diffdock seen with long running jobs

currently running diffdock with the following loop:

for dir in 6o9b 4ayt 5jh6 3e73 4fz6 5kr2 4ucd 2hz0 3lxg; do echo "$dir,$(./plex -app diffdock -input-dir "/home/ubuntu/PDBBind_processed/$dir" -gpu=true -network=true | awk '/Created job directory/ {gsub(/\/$/, "", $NF); printf("%s,",$NF)} /Bacalhau Job Id/ {print $NF}')"; done > "job_results_$now.csv"

Results so far:

image

No job directory has output data, while all manual bacalhau pulls have data.

Example below:

ubuntu@ip-172-31-90-44:~/plex$ ls /home/ubuntu/plex/07d532d6-3856-4679-a5ea-20ce5ff3e98b/
4ayt_ligand.mol2            4ayt_protein_processed.pdb  index.jsonl                 
4ayt_ligand.sdf             index.csv    
ubuntu@ip-172-31-90-44:~/plex$ ls /home/ubuntu/plex/job-21f5fc6c/combined_results/outputs/
complex_names.npy  index0_..-inputs-4ayt_protein_processed.pdb____..-inputs-4ayt_ligand.sdf  run_times.npy
confidences.npy    min_self_distances.npy
esm2_output        prepared_for_esm.fasta
NiklasTR commented 1 year ago

testdata/binding/abl

I checked the presence of the directory and validated the content. This should not be the issue from my perspective

NiklasTR commented 1 year ago

waiting for #115 for more efficient debugging

NiklasTR commented 1 year ago

Closing this as now also Equibind is handling mol2 files