kalininalab / DataSAIL

DataSAIL is a tool to split datasets while reducing information leakage.
https://datasail.readthedocs.io
MIT License
18 stars 1 forks source link

KeyError #17

Closed EasternCaveMan closed 10 months ago

EasternCaveMan commented 10 months ago

Hi Roman I hope you are doing well, First I got a lot of this WARNING and then it crashed with error related to ID1

(sail) [vat23@wibi-mickey enzyme_substrate_data]$ datasail --e-type M --e-data data_or_mol.tsv --e-sim ecfp --f-type P --f-data All_sequences_or.fasta --f-sim cdhit --output split_C2 --techniques C2 --splits 0.8 0.2  --names train  test --runs 3 --solver SCIP --inter interaction_or.csv
[02:47:48] WARNING: Proton(s) added/removed
.
:
:
:
:
:
:

[02:50:48] WARNING: Proton(s) added/removed
Traceback (most recent call last):
  File "/home/vat23/miniconda3/envs/sail/bin/datasail", line 11, in <module>
    sys.exit(sail())
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/sail.py", line 227, in sail
    datasail_main(**kwargs)
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/routine.py", line 28, in datasail_main
    e_dataset, f_dataset, inter = read_data(**kwargs)
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/reader/read.py", line 36, in read_data
    e_dataset = read_data_type(kwargs[KW_E_TYPE])(
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/reader/read_molecules.py", line 80, in read_molecule_data
    dataset = remove_molecule_duplicates(dataset)
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/reader/read_molecules.py", line 107, in remove_molecule_duplicates
    return remove_duplicate_values(dataset, valid_mols)
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/reader/read_molecules.py", line 131, in remove_duplicate_values
    dataset.weights[dataset.id_map[name]] += dataset.weights[name]
KeyError: 'ID1'

here is input structure: For Molecules:

   ids                                             SMILES
0  ID0                                             CC(C)O
1  ID1  NC1NCNC2C1NCN2[C@@H]1O[C@H](COP(=O)([O-])OP(=O...
2  ID2  NC1NCNC2C1NCN2[C@@H]1O[C@H](COP(=O)([O-])OP(=O...

for Proteins

>ID0
ALREIRILGSFWGTTNDLDDVLKLVSEGKVKPVVRSAKLKELPEYIEKLRNNAYEGRVVFNP..........
>ID1
FYAQELQRAGAAVVVSLADADASVKVPAEWTTVNIKPKDSVSEVTFAVLSQLSDEGYL.........
>ID2
MDVPLPVEKLSYGSNTEDKTCVVLVATGSFNPPTFMHLRMFELARDELRSKGFHVLGGYMSPVNDAYKKKI........

for interaction file:

    0    1
0  ID0  ID0
1  ID1  ID1
2  ID2  ID2
3  ID3  ID3

I checked the SMILES string for ID1, it belongs to string NC1NCNC2C1NCN2[C@@H]1O[C@H](COP(=O)([O-])OP(=O)([O-])OP(=O)([O-])[O-])[C@@H](O)[C@H]1O which is ATP, the most frequent molecule in my dataset which has many IDs in molecules file.