kalininalab / DataSAIL

DataSAIL is a tool to split datasets while reducing information leakage.
https://datasail.readthedocs.io
MIT License
14 stars 1 forks source link

TypeError: 'NoneType' object is not iterable for split method Identity-based double-cold split (I2) #15

Closed EasternCaveMan closed 4 months ago

EasternCaveMan commented 5 months ago

Hi Roman, I tried to split my data by method Identity-based double-cold split (I2). but I got this error.

(sail) [vat23@wibi-mickey enzyme_substrate_data]$ ls
All_sequences.fasta                        molecule_data.tsv 
split_C2                                            split_R
(sail) [vat23@wibi-mickey enzyme_substrate_data]$ datasail --e-type M --e-data molecule_data.tsv --e-sim ecfp --f-type P --f-data All_sequences.fasta --f-sim cdhit --output split_I2 --techniques I2 --splits 0.8 0.2  --names train  test --runs 3 --solver SCIP
[23:51:30] SMILES Parse Error: syntax error while parsing: ID63558
[23:51:30] SMILES Parse Error: Failed parsing SMILES 'ID63558' for input: 'ID63558'
[23:51:30] SMILES Parse Error: syntax error while parsing: ID63559
[23:51:30] SMILES Parse Error: Failed parsing SMILES 'ID63559' for input: 'ID63559'
[23:51:30] SMILES Parse Error: syntax error while parsing: ID63560
[23:51:30] SMILES Parse Error: Failed parsing SMILES 'ID63560' for input: 'ID63560'
Traceback (most recent call last):
  File "/home/vat23/miniconda3/envs/sail/bin/datasail", line 11, in <module>
    sys.exit(sail())
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/sail.py", line 227, in sail
    datasail_main(**kwargs)
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/routine.py", line 58, in datasail_main
    inter_split_map, e_name_split_map, f_name_split_map, e_cluster_split_map, f_cluster_split_map = run_solver(
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/solver/solve.py", line 138, in run_solver
    inter=set(inter),
TypeError: 'NoneType' object is not iterable

input structure for All_sequences.fasta

>ID0
FFEGKNIFVTGGTGLLGKVLVEKILRSTPIGKIYVLVKADDQEAAVDRITKELINSELFRCLKEKHGKYYQAYIRETLIPIVGNICEPNLGMDSDSAHAIMEDVNVIIESAAITTLNERYDVSLEANVNSPQQLMRFAKTCKN
>ID1
MDPHNKGVAEAEFFTEYGEASRYEIQEVIGKGSYGIVGSVIDTHTGERVAIKKINDVFEHVSDATRILREIKKADP

input structure for molecule_data.tsv

   ids                                             SMILES
0  ID0  NC(=O)C1=CN(C=CC1)[C@@H]1O[C@H](COP(O)(=O)OP(O...
1  ID1  NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(O)(=O)OP(O...
2  ID2  NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(O)(=O)OP(O...
3  ID3  NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(O)(=O)OP(O...
4  ID4   N[C@@H](CCC(=O)N[C@@H](CSCO)C(=O)NCC(O)=O)C(O)=O

I am looking forward to hear from you Best Vahid

Old-Shatterhand commented 5 months ago

That's the same problem as in issue #13 .