Closed EasternCaveMan closed 10 months ago
Hi Roman I hope you are doing well, First I got a lot of this WARNING and then it crashed with error related to ID1
(sail) [vat23@wibi-mickey enzyme_substrate_data]$ datasail --e-type M --e-data data_or_mol.tsv --e-sim ecfp --f-type P --f-data All_sequences_or.fasta --f-sim cdhit --output split_C2 --techniques C2 --splits 0.8 0.2 --names train test --runs 3 --solver SCIP --inter interaction_or.csv [02:47:48] WARNING: Proton(s) added/removed . : : : : : : [02:50:48] WARNING: Proton(s) added/removed Traceback (most recent call last): File "/home/vat23/miniconda3/envs/sail/bin/datasail", line 11, in <module> sys.exit(sail()) File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/sail.py", line 227, in sail datasail_main(**kwargs) File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/routine.py", line 28, in datasail_main e_dataset, f_dataset, inter = read_data(**kwargs) File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/reader/read.py", line 36, in read_data e_dataset = read_data_type(kwargs[KW_E_TYPE])( File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/reader/read_molecules.py", line 80, in read_molecule_data dataset = remove_molecule_duplicates(dataset) File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/reader/read_molecules.py", line 107, in remove_molecule_duplicates return remove_duplicate_values(dataset, valid_mols) File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/reader/read_molecules.py", line 131, in remove_duplicate_values dataset.weights[dataset.id_map[name]] += dataset.weights[name] KeyError: 'ID1'
here is input structure: For Molecules:
ids SMILES 0 ID0 CC(C)O 1 ID1 NC1NCNC2C1NCN2[C@@H]1O[C@H](COP(=O)([O-])OP(=O... 2 ID2 NC1NCNC2C1NCN2[C@@H]1O[C@H](COP(=O)([O-])OP(=O...
for Proteins
>ID0 ALREIRILGSFWGTTNDLDDVLKLVSEGKVKPVVRSAKLKELPEYIEKLRNNAYEGRVVFNP.......... >ID1 FYAQELQRAGAAVVVSLADADASVKVPAEWTTVNIKPKDSVSEVTFAVLSQLSDEGYL......... >ID2 MDVPLPVEKLSYGSNTEDKTCVVLVATGSFNPPTFMHLRMFELARDELRSKGFHVLGGYMSPVNDAYKKKI........
for interaction file:
0 1 0 ID0 ID0 1 ID1 ID1 2 ID2 ID2 3 ID3 ID3
I checked the SMILES string for ID1, it belongs to string NC1NCNC2C1NCN2[C@@H]1O[C@H](COP(=O)([O-])OP(=O)([O-])OP(=O)([O-])[O-])[C@@H](O)[C@H]1O which is ATP, the most frequent molecule in my dataset which has many IDs in molecules file.
NC1NCNC2C1NCN2[C@@H]1O[C@H](COP(=O)([O-])OP(=O)([O-])OP(=O)([O-])[O-])[C@@H](O)[C@H]1O
Hi Roman I hope you are doing well, First I got a lot of this WARNING and then it crashed with error related to ID1
here is input structure: For Molecules:
for Proteins
for interaction file:
I checked the SMILES string for ID1, it belongs to string
NC1NCNC2C1NCN2[C@@H]1O[C@H](COP(=O)([O-])OP(=O)([O-])OP(=O)([O-])[O-])[C@@H](O)[C@H]1O
which is ATP, the most frequent molecule in my dataset which has many IDs in molecules file.