jyaacoub / MutDTA

Improving the precision oncology pipeline by providing binding affinity purtubations predictions on a pirori identified cancer driver genes.
1 stars 2 forks source link

SMILE parsing issue #27

Closed jyaacoub closed 1 year ago

jyaacoub commented 1 year ago

For the Platinum dataset (see #26) parsing the smiles to create contact maps causes issues:

Output:

Processing...
data/plat_mut/processed/XY.csv file found, using it to create the dataset
Number of codes: 1008
Creating protein graphs: 100%|██████████| 943/943 [00:00<00:00, 1987.76it/s]
Creating ligand graphs:   0%|          | 0/197 [00:00<?, ?it/s][22:55:34] SMILES Parse Error: syntax error while parsing: CCCC[C@@H](C(=O)N)NC(=O)[C@H](C)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](Cc1ccccc1)NC[C@H](CC(C)C)NC(=O)[C@H](C(C)C)NC(=O)[C@H](CCCNC(=[
[22:55:34] SMILES Parse Error: Failed parsing SMILES 'CCCC[C@@H](C(=O)N)NC(=O)[C@H](C)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](Cc1ccccc1)NC[C@H](CC(C)C)NC(=O)[C@H](C(C)C)NC(=O)[C@H](CCCNC(=[' for input: 'CCCC[C@@H](C(=O)N)NC(=O)[C@H](C)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](Cc1ccccc1)NC[C@H](CC(C)C)NC(=O)[C@H](C(C)C)NC(=O)[C@H](CCCNC(=['
Creating ligand graphs:   1%|          | 2/197 [00:00<00:00, 388.15it/s]

Error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/home/jyaacoub/projects/MutDTA/run.py in line 3
      [1](file:///home/jyaacoub/projects/MutDTA/run.py?line=0) #%%
      [2](file:///home/jyaacoub/projects/MutDTA/run.py?line=1) from src.data_processing.datasets import PlatinumDataset
----> [3](file:///home/jyaacoub/projects/MutDTA/run.py?line=2) PlatinumDataset('[./data/plat](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a224a43592d50435f77736c227d.vscode-resource.vscode-cdn.net/home/jyaacoub/projects/MutDTA/data/plat)', '[./data/plat](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a224a43592d50435f77736c227d.vscode-resource.vscode-cdn.net/home/jyaacoub/projects/MutDTA/data/plat)')

File [~/projects/MutDTA/src/data_processing/datasets.py:494](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a224a43592d50435f77736c227d.vscode-resource.vscode-cdn.net/home/jyaacoub/projects/MutDTA/~/projects/MutDTA/src/data_processing/datasets.py:494), in PlatinumDataset.__init__(self, save_root, data_root, aln_dir, cmap_threshold, feature_opt, mutated, *args, **kwargs)
    [491](file:///home/jyaacoub/projects/MutDTA/src/data_processing/datasets.py?line=490) if aln_dir is not None:
    [492](file:///home/jyaacoub/projects/MutDTA/src/data_processing/datasets.py?line=491)     print('WARNING: aln_dir is not used for Platinum dataset, no support for MSA alignments.')
--> [494](file:///home/jyaacoub/projects/MutDTA/src/data_processing/datasets.py?line=493) super().__init__(save_root, data_root, None, cmap_threshold, 
    [495](file:///home/jyaacoub/projects/MutDTA/src/data_processing/datasets.py?line=494)                  feature_opt, *args, **kwargs)

File [~/projects/MutDTA/src/data_processing/datasets.py:65](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a224a43592d50435f77736c227d.vscode-resource.vscode-cdn.net/home/jyaacoub/projects/MutDTA/~/projects/MutDTA/src/data_processing/datasets.py:65), in BaseDataset.__init__(self, save_root, data_root, aln_dir, cmap_threshold, feature_opt, *args, **kwargs)
     [62](file:///home/jyaacoub/projects/MutDTA/src/data_processing/datasets.py?line=61) else:
     [63](file:///home/jyaacoub/projects/MutDTA/src/data_processing/datasets.py?line=62)     raise Exception("Invalid feature_opt please pick from nomsa, msa, shannon")
---> [65](file:///home/jyaacoub/projects/MutDTA/src/data_processing/datasets.py?line=64) super(BaseDataset, self).__init__(save_root, *args, **kwargs)
     [66](file:///home/jyaacoub/projects/MutDTA/src/data_processing/datasets.py?line=65) self.load()

File [~/projects/MutDTA/.venv/lib/python3.10/site-packages/torch_geometric/data/in_memory_dataset.py:57](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a224a43592d50435f77736c227d.vscode-resource.vscode-cdn.net/home/jyaacoub/projects/MutDTA/~/projects/MutDTA/.venv/lib/python3.10/site-packages/torch_geometric/data/in_memory_dataset.py:57), in InMemoryDataset.__init__(self, root, transform, pre_transform, pre_filter, log)
     [49](file:///home/jyaacoub/projects/MutDTA/.venv/lib/python3.10/site-packages/torch_geometric/data/in_memory_dataset.py?line=48) def __init__(
     [50](file:///home/jyaacoub/projects/MutDTA/.venv/lib/python3.10/site-packages/torch_geometric/data/in_memory_dataset.py?line=49)     self,
     [51](file:///home/jyaacoub/projects/MutDTA/.venv/lib/python3.10/site-packages/torch_geometric/data/in_memory_dataset.py?line=50)     root: Optional[str] = None,
   (...)
     [55](file:///home/jyaacoub/projects/MutDTA/.venv/lib/python3.10/site-packages/torch_geometric/data/in_memory_dataset.py?line=54)     log: bool = True,
...
---> [34](file:///home/jyaacoub/projects/MutDTA/src/feature_extraction/ligand.py?line=33) atoms = mol.GetAtoms()
     [35](file:///home/jyaacoub/projects/MutDTA/src/feature_extraction/ligand.py?line=34) features = np.zeros((len(atoms), 78))
     [36](file:///home/jyaacoub/projects/MutDTA/src/feature_extraction/ligand.py?line=35) for i, atom in enumerate(atoms):

AttributeError: 'NoneType' object has no attribute 'GetAtoms'
jyaacoub commented 1 year ago

The problem is that the full ligand was not included in the csv file!

The ligand mentioned above has id 0Q4 and is found at https://www.ebi.ac.uk/pdbe-srv/pdbechem/chemicalCompound/show/0Q4.

What is included in the csv is only the portion highlighted in the screenshot below:

image
jyaacoub commented 1 year ago

The solution for this is to download sdf files and extract smiles from those files using ProDy tool.

e.g.: https://github.com/MunibaFaiza/cheminformatics/blob/main/pdb_ligand_id-to-smi.ipynb