pinder-org / pinder

PINDER: The Protein INteraction Dataset and Evaluation Resource
https://pinder-org.github.io/pinder/
Apache License 2.0
90 stars 7 forks source link

Unique Apo Structures #8

Closed ejmeitz closed 2 months ago

ejmeitz commented 2 months ago

Where does the 47,717 unique apo structure number come from in the paper? I am trying to reproduce that to run some metrics on just the apo structures with the code below and neither matches that 47k number.

This has 16312 expected 47,717

all_apo = index.query(
        'apo_R or apo_L',
    ).reset_index(drop=True).loc[:, ['id','pdb_id','uniprot_R', 'uniprot_L', 'chain_R', 'chain_L']]    

unique_apo = all_apo.drop_duplicates(subset = ['uniprot_R', 'uniprot_L'])

This has 82846 (although this query combines the AF2 which had 42,827)

all_apo = index.query(
        'apo_R or apo_L or predicted_R or predicted_L',
    ).reset_index(drop=True).loc[:, ['id','pdb_id','uniprot_R', 'uniprot_L', 'chain_R', 'chain_L']]    

unique_apo = all_apo.drop_duplicates(subset = ['uniprot_R', 'uniprot_L'])
danielkovtun commented 2 months ago

The 47,717 corresponds to the total count of distinct PDB structures assigned as an apo monomer. There can be repeats across different pinder systems as well as within a system when it's e.g., a homodimer.

The apo structures are surfaced in the index via 4 different columns:

For AF2 structures, it is similar where you can have one AF2 monomer repeated within a dimer (uniprot of receptor and ligand are the same) and repeated across dimers (potentially as receptor in one dimer and ligand in another).

To reproduce the counts in the paper, you can do something like this:

from tqdm import tqdm

apo_pdbs = set()
for pdb_str in tqdm(set(index.apo_R_pdbs).union(set(index.apo_L_pdbs))):
    # Split full set of apo monomer PDBs in the string (separated by semi-colon)
    apo_pdbs = apo_pdbs.union(set(pdb_str.split(";")))
# remove empty (no apo) string from counts
apo_pdbs = apo_pdbs - {""}

# Up to one AF monomer per chain (unlike apo where there can be multiple)
af_pdbs = set(index.predicted_R_pdb).union(set(index.predicted_L_pdb)) - {""}

print(len(apo_pdbs), len(af_pdbs))
>>> (47717, 42827)

The PinderSystem object provides a mechanism to load alternate apo structures other than the canonical one if desired: https://github.com/pinder-org/pinder/blob/8ad1ead7a174736635c13fa7266d9ca54cf9f44e/tests/core/loader/test_base.py#L104-L108

When the pdb codes are not provided to the PinderSystem constructor, the default canonical apo PDBs are used.

ejmeitz commented 2 months ago

Thank you

ejmeitz commented 2 months ago

Actually, related to this. Is there no quick way to load these structures to an AtomArray, I want to run some filters over them? You can't make them into a PinderSystem since its just the one structure. You can load them as a AtomArray with PinderSystem.load_structure but that requires you to know the path.

The names returned by index.apo_R_pdbs aren't the actual names as far as I can tell, I think its just missing the R or the L. I could try all name pairings and just see what file exists but wondering if there's a more elegant way to do this?

Edit: Nvm that question isn't actually a problem. Although I cannot find some of the pdbs generated by the above code. For example, 3t83__A1_P00918 does not appear in the pdbs or test_set_pdbs folders. I searched and loaded a system that had this as its apo and it looks like it downloaded it from the cloud? Are all the PDBs not in the dataset download?? or does it just mean my download was messed up somehow?

danielkovtun commented 2 months ago

Although I cannot find some of the pdbs generated by the above code. For example, 3t83__A1_P00918 does not appear in the pdbs or test_set_pdbs folders. I searched and loaded a system that had this as its apo and it looks like it downloaded it from the cloud? Are all the PDBs not in the dataset download?? or does it just mean my download was messed up somehow?

@ejmeitz valid question and apologies for the confusion, will add this to the documentation.

Currently, the archive containing PDB files (pdbs.zip) only contains PDB files associated with any of the "main" splits (train, val, test). Systems that are in the invalid split are not included, but can be downloaded "ad-hoc" either via PinderSystem or by downloading the full pdbs/ bucket path. This was done to reduce required disk space for the most common use-case where the invalid split is less likely to be needed.

As you discovered, when the files associated with a PinderSystem are not discovered in the local filesystem, they are automatically downloaded by this method: https://github.com/pinder-org/pinder/blob/8ad1ead7a174736635c13fa7266d9ca54cf9f44e/src/pinder-core/pinder/core/index/system.py#L512-L543

So you could get all of the apo PDBs loaded by creating a PinderSystem object associated with all of those apo PDBs, or you could just replicate what that download_entry method is doing but only for the apo PDBs of interest.

So, extending the previous example I shared it would look something like this:

from pathlib import Path
from tqdm import tqdm

from pinder.core import get_pinder_location, get_index, get_pinder_bucket_root
from pinder.core.loader.structure import Structure 
from pinder.core.utils.cloud import Gsutil

index = get_index()

apo_pdbs = set()
for pdb_str in tqdm(set(index.apo_R_pdbs).union(set(index.apo_L_pdbs))):
    # Split full set of apo monomer PDBs in the string (separated by semi-colon)
    apo_pdbs = apo_pdbs.union(set(pdb_str.split(";")))
# remove empty (no apo) string from counts
apo_pdbs = apo_pdbs - {""}

# Download missing apo PDB files selectively
pdbs_path = get_pinder_location() / "pdbs"
missing_pdbs = [p for p in apo_pdbs if not (pdbs_path / p).is_file()]

gs = Gsutil()

sources = []
for p in missing_pdbs:
    # Bucket path to missing PDB file
    pdb_source = f"{get_pinder_bucket_root()}/pdbs/{p}"
    # Optional, but contains residue-level information 
    map_source = f"{get_pinder_bucket_root()}/mappings/{p.replace('.pdb', '.parquet')}"
    sources.extend([pdb_source, map_source])

if sources:
    anchor = get_pinder_bucket_root()
    gs.cp_paths(sources, get_pinder_location(), anchor)

# Now load any arbitrary apo structure 
apo_pdb_paths = [get_pinder_location() / f"pdbs/{p}" for p in missing_pdbs]
struct = Structure(apo_pdb_paths[0])
struct
ejmeitz commented 2 months ago

That's useful but does not quite seem true? For example, 7m4x__V1_A0A009I821--7m4x__DA1_B7I7A4 is in the train split but it still downloaded pdbs when did PinderSystem("7m4x__V1_A0A009I821--7m4x__DA1_B7I7A4"). For this system I was only interested in the holo structure but maybe it was downloading the apo's since the PinderSystem wants all of it??

I mainly ask because the downloading is really slow and makes it hard to get subsets of the data I'm interested in.

Also some broken files:

danielkovtun commented 2 months ago

Hm, have you downloaded the full dataset via pinder_download? On my end I wasn't able to reproduce any unexpected file downloads after running pinder_download from scratch

Download:

pinder_download
2024-08-06 11:49:01,217 | pinder.core.index.utils:485 | INFO : Using /home/danielkovtun/.local/share as base directory for Pinder 2024-02
2024-08-06 11:49:01,218 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=3, items=3
Downloading 2024-02/mappings.zip: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.74G/6.74G [01:10<00:00, 103MB/s]
Downloading 2024-02/pdbs.zip: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 157G/157G [19:12<00:00, 147MB/s]
2024-08-06 12:08:13,836 | pinder.core.utils.cloud.process_many:23 | INFO : runtime succeeded:   1152.62s█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 157G/157G [19:12<00:00, 125MB/s]
Extracting /home/danielkovtun/.local/share/pinder/2024-02/pdbs.zip: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2729359/2729359 [2:25:52<00:00, 311.83it/s]Extracting /home/danielkovtun/.local/share/pinder/2024-02/test_set_pdbs.zip: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3915/3915 [00:05<00:00, 692.93it/s]
Extracting /home/danielkovtun/.local/share/pinder/2024-02/mappings.zip: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1448951/1448951 [06:57<00:00, 3468.00it/s]
Python 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pinder.core import PinderSystem
>>> ps = PinderSystem("7m4x__V1_A0A009I821--7m4x__DA1_B7I7A4")
>>> 

If you did run that command, perhaps it got interrupted before completion?

In any case, the code I linked above can be used to selectively download any individual PDB (whether holo / apo / predicted).

Where above, row is a single row of the index dataframe. You can pre-filter index to contain your desired subsets of data and then produce the GCS bucket paths according to above spec if you only want a subset of them. (Alternatively, if you have the space for it, run pinder_download and save future downloading time)

Also some broken files:

  • ..../pdbs/3hr2__A3_P02454-R.pdb and ..../pdbs/3hr2__B2_P02466-R.pdb are unable to be loaded. The error is : '0 10.0' cannot be parsed into a number

Unfortunately this was a known issue with the original ingest on an older version of biotite where float columns were not bounded and could violate the PDB file format in some edge cases. Specifically, this meant that B-factor could not be parsed for a small subset of PDB files where the columns overflowed. While the B-factor for those is indeed corrupt, the structure itself should be loadable despite the error message.

>>> index = get_index()
>>> pid = index.query('pdb_id == "3hr2" and holo_R_pdb == "3hr2__A3_P02454-R.pdb"').id.values[0]
>>> ps = PinderSystem(pid)
2024-08-06 15:26:25,696 | pinder.core.structure.atoms:68 | ERROR : Unable to parse /home/danielkovtun/.local/share/pinder/2024-02/pdbs/3hr2__A3_P02454--3hr2__B1_P02466.pdb! '0 10.0' cannot be parsed into a number
2024-08-06 15:26:25,725 | pinder.core.structure.atoms:68 | ERROR : Unable to parse /home/danielkovtun/.local/share/pinder/2024-02/pdbs/3hr2__A3_P02454-R.pdb! '0 10.0' cannot be parsed into a number
2024-08-06 15:26:25,762 | pinder.core.structure.atoms:68 | ERROR : Unable to parse /home/danielkovtun/.local/share/pinder/2024-02/pdbs/3hr2__B1_P02466-L.pdb! '0 10.0' cannot be parsed into a number
>>> ps.holo_receptor.atom_array.shape
(1054,)
>>> ps.holo_ligand.atom_array.shape
(1026,)

Also some broken files:

  • ..../pdbs/4lkf__D1_UNDEFINED-L.pdb is basically blank

Yeah this one is definitely a strange structure. That being said, the pinder structure for asym_id (chain) D is correct - its only 1 non-standard residue from a peptide in the original entry: https://www.rcsb.org/structure/4lkf

image
ejmeitz commented 2 months ago

Interesting, thanks so much for your help (and the dataset). I think I have all that I need.