Could you provide the data preprocessing code?

HwanheeKim813 commented 1 year ago

I am attempting to utilize your code on a different dataset, and I have encountered some challenges while preprocessing the data on my own. If it is not too much trouble, could you please share the preprocessing code with me?

vtarasv commented 1 year ago

The concrete code for data processing and feature engineering is currently a trade secret used by the RECEPTOR.AI company. Therefore, I cannot directly share it. Nevertheless, I will be happy to help and consult on particular issues you face during the preprocessing. What challenges are you currently facing?

pending1face commented 1 year ago

Would you talk about how to get Residue-level node features, please? I have got PDB file from af2 @vtarasv

vtarasv commented 1 year ago

Sure, here is the general strategy:

Install biopython package (e.g. pip install biopython) and additionally DSSP module (conda install -c salilab dssp).
Iterate through each residue in the PDB file using bipython (see corresponding documentation). To access solvent-accessible surface area for a particular residue see SASA module (installed within biopython package, computed on the residue level, the rest parameters are default). Phi angle, Psi angle, belonging to the secondary structure for each residue are accessible via the abovementioned DSSP (examples are in the documentation). AAPHY7 and BLOSUM62 descriptors for each amino acid can be found in the repository of GraphSol authors: https://github.com/jcchan23/GraphSol.git (Data/BLOSUM62_dim23.txt, Data/aa_phy7). The AAPHY7 and BLOSUM62 descriptors are static and don't depend on the particular PDB structure configuration as opposed to the other mentioned descriptors. Note: if you don't need to highlight a distinction between the phosphorylated/non-phosphorylated or mutated/non-mutated structures, you should not use the 'Phosphorylated' and 'Mutated' node features (from Table 4).

pending1face commented 1 year ago

Sure, here is the general strategy:

Install biopython package (e.g. pip install biopython) and additionally DSSP module (conda install -c salilab dssp).

Iterate through each residue in the PDB file using bipython (see corresponding documentation). To access solvent-accessible surface area for a particular residue see SASA module (installed within biopython package, computed on the residue level, the rest parameters are default). Phi angle, Psi angle, belonging to the secondary structure for each residue are accessible via the abovementioned DSSP (examples are in the documentation). AAPHY7 and BLOSUM62 descriptors for each amino acid can be found in the repository of GraphSol authors: https://github.com/jcchan23/GraphSol.git (Data/BLOSUM62_dim23.txt, Data/aa_phy7). The AAPHY7 and BLOSUM62 descriptors are static and don't depend on the particular PDB structure configuration as opposed to the other mentioned descriptors. Note: if you don't need to highlight a distinction between the phosphorylated/non-phosphorylated or mutated/non-mutated structures, you should not use the 'Phosphorylated' and 'Mutated' node features (from Table 4).

Thank you for replying. Recently, I am trying to processing mocular data with rdkit. But I can't find some node features below: Is atom hydrophobic (1 - yes, 0 - no) 1 Is atom metal (1 - yes, 0 - no) 1 Is atom halogen (1 - yes, 0 - no) 1 Is atom donor (1 - yes, 0 - no) 1 Is atom acceptor (1 - yes, 0 - no) 1 And what means '(scaled by min-max)' .Is min-max in an atom or a mocular or the the entire dataset?

Please rely to me, it's really important.

vtarasv commented 1 year ago

First, you define needed patterns and atomic numbers (the ones below were used in our work, mostly taken from Open Drug Discovery Toolkit)

from rdkit import Chem

acceptor_patterns = Chem.MolFromSmarts('[$([O;H1;v2]),'
                                       '$([O;H0;v2;!$(O=N-*),'
                                       '$([O;-;!$(*-N=O)]),'
                                       '$([o;+0])]),'
                                       '$([n;+0;!X3;!$([n;H1](cc)cc),'
                                       '$([$([N;H0]#[C&v4])]),'
                                       '$([N&v3;H0;$(Nc)])]),'
                                       '$([F;$(F-[#6]);!$(FC[F,Cl,Br,I])])]')
donor_patterns = Chem.MolFromSmarts('[$([N&!H0&v3,N&!H0&+1&v4,n&H1&+0,$([$([Nv3](-C)(-C)-C)]),'
                                    '$([$(n[n;H1]),'
                                    '$(nc[n;H1])])]),'
                                    '$([NX3,NX2]([!O,!S])!@C(!@[NX3,NX2]([!O,!S]))!@[NX3,NX2]([!O,!S])),'
                                    '$([O,S;H1;+0])]')
basic_patterns = Chem.MolFromSmarts('[$([N;H2&+0][$([C,a]);!$([C,a](=O))]),'
                                    '$([N;H1&+0]([$([C,a]);!$([C,a](=O))])[$([C,a]);!$([C,a](=O))]),'
                                    '$([N;H0&+0]([C;!$(C(=O))])([C;!$(C(=O))])[C;!$(C(=O))]),'
                                    '$([N,n;X2;+0])]')
acidic_patterns = Chem.MolFromSmarts('[CX3](=O)[OX1H0-,OX2H1]')

metal_anums = [3, 4, 11, 12, 13, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
               30, 31, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
               50, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
               69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
               87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101,
               102, 103]
halogens_anums = [9, 17, 35, 53]

Then, you iterate through each atom in the molecule and check if its number or index match corresponding feature:

mol = Chem.MolFromSmiles("CCCC")
donor_idxs = np.array(mol.GetSubstructMatches(donor_patterns, maxMatches=100_000)).flatten()
acceptor_idxs = np.array(mol.GetSubstructMatches(acceptor_patterns, maxMatches=100_000)).flatten()
basic_idxs = np.array(mol.GetSubstructMatches(basic_patterns, maxMatches=100_000)).flatten()
acidic_idxs = np.array(mol.GetSubstructMatches(acidic_patterns, maxMatches=100_000)).flatten()

for a in mol.GetAtoms():
    idx = a.GetIdx()
    anum = a.GetAtomicNum()
    hydrophobic = (anum == 6 and np.in1d([aneighbor.GetAtomicNum() for aneighbor in a.GetNeighbors()], [6, 1, 0]).all())
    metal = (anum in metal_anums)
    halogen = (anum in halogens_anums)
    donor = (idx in donor_idxs and anum != 6)
    acceptor = (idx in acceptor_idxs and anum != 6)
    plus = (idx in basic_idxs and anum != 6) or a.GetFormalCharge() > 0
    minus = (idx in acidic_idxs and anum != 6) or a.GetFormalCharge() < 0

The min-max scaling is based on the entire dataset. Therefore, you first need to generate unscaled features for all the molecules and then scale them by global min and max values for each feature.

pending1face commented 1 year ago

First, you define needed patterns and atomic numbers (the ones below were used in our work, mostly taken from Open Drug Discovery Toolkit)

from rdkit import Chem

acceptor_patterns = Chem.MolFromSmarts('[$([O;H1;v2]),'
                                       '$([O;H0;v2;!$(O=N-*),'
                                       '$([O;-;!$(*-N=O)]),'
                                       '$([o;+0])]),'
                                       '$([n;+0;!X3;!$([n;H1](cc)cc),'
                                       '$([$([N;H0]#[C&v4])]),'
                                       '$([N&v3;H0;$(Nc)])]),'
                                       '$([F;$(F-[#6]);!$(FC[F,Cl,Br,I])])]')
donor_patterns = Chem.MolFromSmarts('[$([N&!H0&v3,N&!H0&+1&v4,n&H1&+0,$([$([Nv3](-C)(-C)-C)]),'
                                    '$([$(n[n;H1]),'
                                    '$(nc[n;H1])])]),'
                                    '$([NX3,NX2]([!O,!S])!@C(!@[NX3,NX2]([!O,!S]))!@[NX3,NX2]([!O,!S])),'
                                    '$([O,S;H1;+0])]')
basic_patterns = Chem.MolFromSmarts('[$([N;H2&+0][$([C,a]);!$([C,a](=O))]),'
                                    '$([N;H1&+0]([$([C,a]);!$([C,a](=O))])[$([C,a]);!$([C,a](=O))]),'
                                    '$([N;H0&+0]([C;!$(C(=O))])([C;!$(C(=O))])[C;!$(C(=O))]),'
                                    '$([N,n;X2;+0])]')
acidic_patterns = Chem.MolFromSmarts('[CX3](=O)[OX1H0-,OX2H1]')

metal_anums = [3, 4, 11, 12, 13, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
               30, 31, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
               50, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
               69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
               87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101,
               102, 103]
halogens_anums = [9, 17, 35, 53]

Then, you iterate through each atom in the molecule and check if its number or index match corresponding feature:

mol = Chem.MolFromSmiles("CCCC")
donor_idxs = np.array(mol.GetSubstructMatches(donor_patterns, maxMatches=100_000)).flatten()
acceptor_idxs = np.array(mol.GetSubstructMatches(acceptor_patterns, maxMatches=100_000)).flatten()
basic_idxs = np.array(mol.GetSubstructMatches(basic_patterns, maxMatches=100_000)).flatten()
acidic_idxs = np.array(mol.GetSubstructMatches(acidic_patterns, maxMatches=100_000)).flatten()

for a in mol.GetAtoms():
    idx = a.GetIdx()
    anum = a.GetAtomicNum()
    hydrophobic = (anum == 6 and np.in1d([aneighbor.GetAtomicNum() for aneighbor in a.GetNeighbors()], [6, 1, 0]).all())
    metal = (anum in metal_anums)
    halogen = (anum in halogens_anums)
    donor = (idx in donor_idxs and anum != 6)
    acceptor = (idx in acceptor_idxs and anum != 6)
    plus = (idx in basic_idxs and anum != 6) or a.GetFormalCharge() > 0
    minus = (idx in acidic_idxs and anum != 6) or a.GetFormalCharge() < 0

The min-max scaling is based on the entire dataset. Therefore, you first need to generate unscaled features for all the molecules and then scale them by global min and max values for each feature.

Thank you very much. It's really helpful!

speakstone commented 1 year ago

数据处理和特征工程的具体代码目前是RECEPTOR.AI公司使用的商业秘密。因此，我无法直接分享。尽管如此，我很乐意为您在预处理过程中遇到的特定问题提供帮助和咨询。您目前面临哪些挑战？

The concrete code for data processing and feature engineering is currently a trade secret used by the RECEPTOR.AI company. Therefore, I cannot directly share it. Nevertheless, I will be happy to help and consult on particular issues you face during the preprocessing. What challenges are you currently facing?

I observed that you extracted the pdb features of prot_3dfor*.tar.gz to protein_to_graph.pkl. For a beginner, I don't understand how to convert the pdb file into this graph structure. Can you provide a small amount of conversion scripts for instruction while protecting your company's privacy? Thank you very much, looking forward to your reply.

vtarasv commented 1 year ago

https://docs.python.org/3/library/pickle.html#examples

The pickle (.pkl) format is just an easy way to save/load Python objects (serializing and de-serializing a Python object structure). See examples in the link above.

speakstone commented 1 year ago

https://docs.python.org/3/library/pickle.html#examples

pickle (.pkl) 格式只是保存/加载 Python 对象（序列化和反序列化 Python 对象结构）的简单方法。请参阅上面链接中的示例。

Sorry, my expression may be wrong, I am referring to how to convert the pdb file of molecular structure into amino acid features and amino acid connection information, so as to facilitate input into the graph convolutional network

shrimonmuke0202 commented 8 months ago

Can you give hint only the graph creation part of the data preprocessing it will help me very much

vtarasv / 3d-prot-dta

Could you provide the data preprocessing code? #2