kalininalab / DataSAIL

DataSAIL is a tool to split datasets while reducing information leakage.
https://datasail.readthedocs.io
MIT License
18 stars 1 forks source link

Error with Random split method #13

Closed EasternCaveMan closed 7 months ago

EasternCaveMan commented 8 months ago

Hi Roman, There is a problem with split method R:

(sail) [vat23@wibi-mickey enzyme_substrate_data]$ datasail --e-type M --e-data molecule_data.tsv --e-sim ecfp --f-type P --f-data All_sequences.fasta --f-sim cdhit --output split_R --techniques R --splits 0.8 0.2  --names train  test --runs 3 --solver SCIP
[22:00:59] SMILES Parse Error: Failed parsing SMILES 'ID63554' for input: 'ID63554'
[22:00:59] SMILES Parse Error: syntax error while parsing: ID63555
[22:00:59] SMILES Parse Error: Failed parsing SMILES 'ID63555' for input: 'ID63555'
[22:00:59] SMILES Parse Error: syntax error while parsing: ID63556
[22:00:59] SMILES Parse Error: Failed parsing SMILES 'ID63556' for input: 'ID63556'
[22:00:59] SMILES Parse Error: syntax error while parsing: ID63557
[22:00:59] SMILES Parse Error: Failed parsing SMILES 'ID63557' for input: 'ID63557'
[22:00:59] SMILES Parse Error: syntax error while parsing: ID63558
[22:00:59] SMILES Parse Error: Failed parsing SMILES 'ID63558' for input: 'ID63558'
[22:00:59] SMILES Parse Error: syntax error while parsing: ID63559
[22:00:59] SMILES Parse Error: Failed parsing SMILES 'ID63559' for input: 'ID63559'
[22:00:59] SMILES Parse Error: syntax error while parsing: ID63560
[22:00:59] SMILES Parse Error: Failed parsing SMILES 'ID63560' for input: 'ID63560'
Traceback (most recent call last):
  File "/home/vat23/miniconda3/envs/sail/bin/datasail", line 11, in <module>
    sys.exit(sail())
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/sail.py", line 227, in sail
    datasail_main(**kwargs)
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/routine.py", line 58, in datasail_main
    inter_split_map, e_name_split_map, f_name_split_map, e_cluster_split_map, f_cluster_split_map = run_solver(
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/solver/solve.py", line 88, in run_solver
    solution = sample_categorical(
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/solver/utils.py", line 163, in sample_categorical
    np.random.shuffle(inter)
  File "numpy/random/mtrand.pyx", line 4589, in numpy.random.mtrand.RandomState.shuffle
TypeError: object of type 'NoneType' has no len()

input structure for All_sequences.fasta

>ID0
FFEGKNIFVTGGTGLLGKVLVEKILRSTPIGKIYVLVKADDQEAAVDRITKELINSELFRCLKEKHGKYYQAYIRETLIPIVGNICEPNLGMDSDSAHAIMEDVNVIIESAAITTLNERYDVSLEANVNSPQQLMRFAKTCKN
>ID1
MDPHNKGVAEAEFFTEYGEASRYEIQEVIGKGSYGIVGSVIDTHTGERVAIKKINDVFEHVSDATRILREIKKADP

input structure for molecule_data.tsv

   ids                                             SMILES
0  ID0  NC(=O)C1=CN(C=CC1)[C@@H]1O[C@H](COP(O)(=O)OP(O...
1  ID1  NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(O)(=O)OP(O...
2  ID2  NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(O)(=O)OP(O...
3  ID3  NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(O)(=O)OP(O...
4  ID4   N[C@@H](CCC(=O)N[C@@H](CSCO)C(=O)NCC(O)=O)C(O)=O

Best Vahid

Old-Shatterhand commented 8 months ago

You have to provide interactions via --inter flag. Checkout this example. Additionally, you can look into the CLI arguments.

EasternCaveMan commented 8 months ago

Hi @Old-Shatterhand, ah, thank you I missed this part. what should be the format of interaction file produced by this inter=[(x[0], x[0]) for x in df[["ids"]].values.tolist()], ? TSV?In the DataSAIL documentation it is not mentioned explicitly it should be TSV.

Old-Shatterhand commented 8 months ago

CSV it is