Closed ejmeitz closed 4 months ago
The 47,717 corresponds to the total count of distinct PDB structures assigned as an apo monomer. There can be repeats across different pinder
systems as well as within a system when it's e.g., a homodimer.
The apo structures are surfaced in the index via 4 different columns:
apo_R_pdb
- the "canonical" apo monomer selected for the receptor chain of the dimer apo_L_pdb
- the "canonical" apo monomer selected for the ligand chain of the dimer apo_R_pdbs
- all apo structures assigned to the receptor chain of the dimer, joined into a single string separated by a semicolonapo_L_pdbs
- all apo structures assigned to the ligand chain of the dimer, joined into a single string separated by a semicolonFor AF2 structures, it is similar where you can have one AF2 monomer repeated within a dimer (uniprot of receptor and ligand are the same) and repeated across dimers (potentially as receptor in one dimer and ligand in another).
To reproduce the counts in the paper, you can do something like this:
from tqdm import tqdm
apo_pdbs = set()
for pdb_str in tqdm(set(index.apo_R_pdbs).union(set(index.apo_L_pdbs))):
# Split full set of apo monomer PDBs in the string (separated by semi-colon)
apo_pdbs = apo_pdbs.union(set(pdb_str.split(";")))
# remove empty (no apo) string from counts
apo_pdbs = apo_pdbs - {""}
# Up to one AF monomer per chain (unlike apo where there can be multiple)
af_pdbs = set(index.predicted_R_pdb).union(set(index.predicted_L_pdb)) - {""}
print(len(apo_pdbs), len(af_pdbs))
>>> (47717, 42827)
The PinderSystem
object provides a mechanism to load alternate apo structures other than the canonical one if desired:
https://github.com/pinder-org/pinder/blob/8ad1ead7a174736635c13fa7266d9ca54cf9f44e/tests/core/loader/test_base.py#L104-L108
When the pdb codes are not provided to the PinderSystem
constructor, the default canonical apo PDBs are used.
Thank you
Actually, related to this. Is there no quick way to load these structures to an AtomArray, I want to run some filters over them? You can't make them into a PinderSystem
since its just the one structure. You can load them as a AtomArray with PinderSystem.load_structure
but that requires you to know the path.
The names returned by index.apo_R_pdbs
aren't the actual names as far as I can tell, I think its just missing the R or the L. I could try all name pairings and just see what file exists but wondering if there's a more elegant way to do this?
Edit: Nvm that question isn't actually a problem. Although I cannot find some of the pdbs generated by the above code. For example, 3t83__A1_P00918
does not appear in the pdbs
or test_set_pdbs
folders. I searched and loaded a system that had this as its apo and it looks like it downloaded it from the cloud? Are all the PDBs not in the dataset download?? or does it just mean my download was messed up somehow?
Although I cannot find some of the pdbs generated by the above code. For example, 3t83__A1_P00918 does not appear in the pdbs or test_set_pdbs folders. I searched and loaded a system that had this as its apo and it looks like it downloaded it from the cloud? Are all the PDBs not in the dataset download?? or does it just mean my download was messed up somehow?
@ejmeitz valid question and apologies for the confusion, will add this to the documentation.
Currently, the archive containing PDB files (pdbs.zip
) only contains PDB files associated with any of the "main" splits (train
, val
, test
). Systems that are in the invalid
split are not included, but can be downloaded "ad-hoc" either via PinderSystem
or by downloading the full pdbs/
bucket path. This was done to reduce required disk space for the most common use-case where the invalid
split is less likely to be needed.
As you discovered, when the files associated with a PinderSystem
are not discovered in the local filesystem, they are automatically downloaded by this method:
https://github.com/pinder-org/pinder/blob/8ad1ead7a174736635c13fa7266d9ca54cf9f44e/src/pinder-core/pinder/core/index/system.py#L512-L543
So you could get all of the apo PDBs loaded by creating a PinderSystem
object associated with all of those apo PDBs, or you could just replicate what that download_entry
method is doing but only for the apo PDBs of interest.
So, extending the previous example I shared it would look something like this:
from pathlib import Path
from tqdm import tqdm
from pinder.core import get_pinder_location, get_index, get_pinder_bucket_root
from pinder.core.loader.structure import Structure
from pinder.core.utils.cloud import Gsutil
index = get_index()
apo_pdbs = set()
for pdb_str in tqdm(set(index.apo_R_pdbs).union(set(index.apo_L_pdbs))):
# Split full set of apo monomer PDBs in the string (separated by semi-colon)
apo_pdbs = apo_pdbs.union(set(pdb_str.split(";")))
# remove empty (no apo) string from counts
apo_pdbs = apo_pdbs - {""}
# Download missing apo PDB files selectively
pdbs_path = get_pinder_location() / "pdbs"
missing_pdbs = [p for p in apo_pdbs if not (pdbs_path / p).is_file()]
gs = Gsutil()
sources = []
for p in missing_pdbs:
# Bucket path to missing PDB file
pdb_source = f"{get_pinder_bucket_root()}/pdbs/{p}"
# Optional, but contains residue-level information
map_source = f"{get_pinder_bucket_root()}/mappings/{p.replace('.pdb', '.parquet')}"
sources.extend([pdb_source, map_source])
if sources:
anchor = get_pinder_bucket_root()
gs.cp_paths(sources, get_pinder_location(), anchor)
# Now load any arbitrary apo structure
apo_pdb_paths = [get_pinder_location() / f"pdbs/{p}" for p in missing_pdbs]
struct = Structure(apo_pdb_paths[0])
struct
That's useful but does not quite seem true? For example, 7m4x__V1_A0A009I821--7m4x__DA1_B7I7A4
is in the train
split but it still downloaded pdbs when did PinderSystem("7m4x__V1_A0A009I821--7m4x__DA1_B7I7A4")
. For this system I was only interested in the holo structure but maybe it was downloading the apo's since the PinderSystem
wants all of it??
I mainly ask because the downloading is really slow and makes it hard to get subsets of the data I'm interested in.
Also some broken files:
..../pdbs/3hr2__A3_P02454-R.pdb
and ..../pdbs/3hr2__B2_P02466-R.pdb
are unable to be loaded. The error is : '0 10.0' cannot be parsed into a number
..../pdbs/4lkf__D1_UNDEFINED-L.pdb
is basically blankHm, have you downloaded the full dataset via pinder_download
? On my end I wasn't able to reproduce any unexpected file downloads after running pinder_download
from scratch
Download:
pinder_download
2024-08-06 11:49:01,217 | pinder.core.index.utils:485 | INFO : Using /home/danielkovtun/.local/share as base directory for Pinder 2024-02
2024-08-06 11:49:01,218 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=3, items=3
Downloading 2024-02/mappings.zip: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.74G/6.74G [01:10<00:00, 103MB/s]
Downloading 2024-02/pdbs.zip: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 157G/157G [19:12<00:00, 147MB/s]
2024-08-06 12:08:13,836 | pinder.core.utils.cloud.process_many:23 | INFO : runtime succeeded: 1152.62s█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 157G/157G [19:12<00:00, 125MB/s]
Extracting /home/danielkovtun/.local/share/pinder/2024-02/pdbs.zip: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2729359/2729359 [2:25:52<00:00, 311.83it/s]Extracting /home/danielkovtun/.local/share/pinder/2024-02/test_set_pdbs.zip: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3915/3915 [00:05<00:00, 692.93it/s]
Extracting /home/danielkovtun/.local/share/pinder/2024-02/mappings.zip: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1448951/1448951 [06:57<00:00, 3468.00it/s]
Python 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pinder.core import PinderSystem
>>> ps = PinderSystem("7m4x__V1_A0A009I821--7m4x__DA1_B7I7A4")
>>>
If you did run that command, perhaps it got interrupted before completion?
In any case, the code I linked above can be used to selectively download any individual PDB (whether holo / apo / predicted).
gs://pinder/2024-02/pdbs/{row.id}.pdb
gs://pinder/2024-02/pdbs/{row.holo_R_pdb}
gs://pinder/2024-02/pdbs/{row.holo_L_pdb}
gs://pinder/2024-02/pdbs/{row.predicted_R_pdb}
gs://pinder/2024-02/pdbs/{row.predicted_L_pdb}
gs://pinder/2024-02/pdbs/{row.apo_R_pdb}
gs://pinder/2024-02/pdbs/{row.apo_L_pdb}
[f"gs://pinder/2024-02/pdbs/{apo}" for apo in row.apo_R_pdbs.split(";") if apo != ""]
[f"gs://pinder/2024-02/pdbs/{apo}" for apo in row.apo_L_pdbs.split(";") if apo != ""]
Where above, row is a single row of the index dataframe. You can pre-filter index to contain your desired subsets of data and then produce the GCS bucket paths according to above spec if you only want a subset of them. (Alternatively, if you have the space for it, run pinder_download
and save future downloading time)
Also some broken files:
..../pdbs/3hr2__A3_P02454-R.pdb
and..../pdbs/3hr2__B2_P02466-R.pdb
are unable to be loaded. The error is : '0 10.0' cannot be parsed into a number
Unfortunately this was a known issue with the original ingest on an older version of biotite where float columns were not bounded and could violate the PDB file format in some edge cases. Specifically, this meant that B-factor could not be parsed for a small subset of PDB files where the columns overflowed. While the B-factor for those is indeed corrupt, the structure itself should be loadable despite the error message.
>>> index = get_index()
>>> pid = index.query('pdb_id == "3hr2" and holo_R_pdb == "3hr2__A3_P02454-R.pdb"').id.values[0]
>>> ps = PinderSystem(pid)
2024-08-06 15:26:25,696 | pinder.core.structure.atoms:68 | ERROR : Unable to parse /home/danielkovtun/.local/share/pinder/2024-02/pdbs/3hr2__A3_P02454--3hr2__B1_P02466.pdb! '0 10.0' cannot be parsed into a number
2024-08-06 15:26:25,725 | pinder.core.structure.atoms:68 | ERROR : Unable to parse /home/danielkovtun/.local/share/pinder/2024-02/pdbs/3hr2__A3_P02454-R.pdb! '0 10.0' cannot be parsed into a number
2024-08-06 15:26:25,762 | pinder.core.structure.atoms:68 | ERROR : Unable to parse /home/danielkovtun/.local/share/pinder/2024-02/pdbs/3hr2__B1_P02466-L.pdb! '0 10.0' cannot be parsed into a number
>>> ps.holo_receptor.atom_array.shape
(1054,)
>>> ps.holo_ligand.atom_array.shape
(1026,)
Also some broken files:
..../pdbs/4lkf__D1_UNDEFINED-L.pdb
is basically blank
Yeah this one is definitely a strange structure. That being said, the pinder structure for asym_id (chain) D is correct - its only 1 non-standard residue from a peptide in the original entry: https://www.rcsb.org/structure/4lkf
Interesting, thanks so much for your help (and the dataset). I think I have all that I need.
Where does the 47,717 unique apo structure number come from in the paper? I am trying to reproduce that to run some metrics on just the apo structures with the code below and neither matches that 47k number.
This has 16312 expected 47,717
This has 82846 (although this query combines the AF2 which had 42,827)