plinder-org / plinder

Protein Ligand INteraction Dataset and Evaluation Resource
https://plinder.sh
Apache License 2.0
166 stars 9 forks source link

Plinder system performance improvements #39

Closed tjduigna closed 2 months ago

tjduigna commented 2 months ago

Iterating over plinder systems identifies some inefficiencies which are not obvious when working in the low sample regime. This PR accomplishes the following:

With all of these changes, iterating over PlinderSystems and using the PlinderDatset becomes reasonable in terms of runtime performance. Things can be further expedited with the usage of the PLINDER_OFFLINE=true environment variable.

github-actions[bot] commented 2 months ago

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  src/plinder/core/index
  utils.py 181, 232, 240, 255, 269, 271
  src/plinder/core/loader
  loader.py 11, 53-55, 78-84, 95-96
  src/plinder/core/split
  utils.py
  src/plinder/core/system
  system.py 98, 242-247
  utils.py 24
  src/plinder/core/utils
  config.py
  cpl.py 74, 101, 155, 170-171
  unpack.py 131-133, 143
  src/plinder/eval/docking
  utils.py
  write_scores.py
Project Total  

This report was generated by python-coverage-comment-action

tjduigna commented 2 months ago

Some timing results:

please run:

pip install plinder[loader]

to enable the data loader

2024-08-30 18:41:03,267 | plinder.core.index.utils:190 | INFO : Syncing gs://plinder/2024-06/v2 -> ~/.local/share/plinder/2024-06/v2. If this is the first time you are running this command, it will take a while!

The estimated time on the progress bar may vary wildly based on varied file sizes. If you need to cancel this and come back to it, it will pick up where it left off.

2024-08-30 18:41:04,090 | plinder.core.index.utils:228 | INFO : Syncing clusters 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 665/665 [00:16<00:00, 40.17it/s] 2024-08-30 18:41:20,851 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 16.63s 2024-08-30 18:41:20,886 | plinder.core.index.utils:228 | INFO : Syncing entries 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1060/1060 [00:22<00:00, 47.36it/s] 2024-08-30 18:41:43,602 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 22.53s 2024-08-30 18:41:43,635 | plinder.core.index.utils:228 | INFO : Syncing fingerprints 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 4.24it/s] 2024-08-30 18:41:44,425 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.71s 2024-08-30 18:41:44,472 | plinder.core.index.utils:228 | INFO : Syncing index 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.53s/it] 2024-08-30 18:41:51,597 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 7.06s 2024-08-30 18:41:51,646 | plinder.core.index.utils:228 | INFO : Syncing ligand_scores 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 513/513 [00:09<00:00, 56.18it/s] 2024-08-30 18:42:01,070 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 9.32s 2024-08-30 18:42:01,109 | plinder.core.index.utils:228 | INFO : Syncing ligands 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2191/2191 [00:18<00:00, 121.04it/s] 2024-08-30 18:42:19,751 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 18.30s 2024-08-30 18:42:19,789 | plinder.core.index.utils:228 | INFO : Syncing links 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3.87it/s] 2024-08-30 18:42:20,360 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.52s 2024-08-30 18:42:20,393 | plinder.core.index.utils:228 | INFO : Syncing linked_structures, this may take a while! 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1060/1060 [06:07<00:00, 2.88it/s] 2024-08-30 18:48:28,266 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 367.68s 2024-08-30 18:48:28,307 | plinder.core.index.utils:236 | INFO : extracting linked_structures archives, you may want to stretch your legs. 2024-08-30 18:48:30,552 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s 2024-08-30 18:48:30,552 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1060/1060 [06:13<00:00, 2.84it/s] 2024-08-30 18:54:44,417 | plinder.core.index.utils:228 | INFO : Syncing mmp 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.21it/s] 2024-08-30 18:54:45,417 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.91s 2024-08-30 18:54:45,455 | plinder.core.index.utils:209 | INFO : Syncing scores/search_db=apo 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.53s/it] 2024-08-30 18:54:48,052 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 2.53s 2024-08-30 18:54:48,100 | plinder.core.index.utils:209 | INFO : Syncing scores/search_db=pred 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.23it/s] 2024-08-30 18:54:48,612 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.45s 2024-08-30 18:54:48,670 | plinder.core.index.utils:209 | INFO : Syncing scores/search_db=holo, this may take a while! 2024-08-30 18:54:48,670 | plinder.core.index.utils:211 | INFO : the tqdm progress bar for holo is not very useful, please be patient! 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [09:34<00:00, 15.95s/it] 2024-08-30 19:04:23,129 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 574.38s 2024-08-30 19:04:23,174 | plinder.core.index.utils:228 | INFO : Syncing splits 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5.48it/s] 2024-08-30 19:04:23,610 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.37s 2024-08-30 19:04:23,659 | plinder.core.index.utils:228 | INFO : Syncing systems, this may take a while! 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1060/1060 [19:05<00:00, 1.08s/it] 2024-08-30 19:23:29,434 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 1145.58s 2024-08-30 19:23:29,482 | plinder.core.index.utils:236 | INFO : extracting systems archives, you may want to stretch your legs. 2024-08-30 19:23:41,984 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s 2024-08-30 19:23:41,984 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1060/1060 [23:46<00:00, 1.35s/it] 2024-08-30 19:23:42,001 | plinder.core.index.utils:242 | INFO : Sync complete in 36.12m!

If you downloaded all of the data, you can run:

export PLINDER_OFFLINE=true

This will avoid checking that files are still in sync when using plinder.core. If you didn't download all of the data, plinder.core will download it lazily when it's needed. By default, plinder.core will check that files are still in sync in case any of the files for an existing release need to be patched.

- Subsequent re-run after the dataset was downloaded:
```console
[tjd plinder]$ plinder_download --yes
...
Sync complete in 41.47s!
...
tjduigna commented 2 months ago

Timing results for PlinderDataset with the following code snippet:

from time import time

from plinder.core import get_split, PlinderDataset

split = get_split()
dataset = PlinderDataset(df=split, load_alternative_structures=True)
for i in range(dataset._num_examples):
    t0 = time()
    dataset[i]
    t1 = time()
    print(f"time for index {i}: {t1 - t0:.2f}s")

time for index 0: 0.40s time for index 1: 0.19s time for index 2: 0.18s time for index 3: 0.19s time for index 4: 0.21s time for index 5: 0.23s time for index 6: 0.19s time for index 7: 0.22s time for index 8: 0.19s time for index 9: 0.23s time for index 10: 0.23s ...

- on PR branch with `PLINDER_OFFLINE=true`
```console
time for index 0: 0.24s
time for index 1: 0.03s
time for index 2: 0.03s
time for index 3: 0.03s
time for index 4: 0.03s
time for index 5: 0.03s
time for index 6: 0.03s
time for index 7: 0.08s
time for index 8: 0.03s
time for index 9: 0.08s
time for index 10: 0.07s
...