kalininalab / DataSAIL

DataSAIL is a tool to split datasets while reducing information leakage.
https://datasail.readthedocs.io
MIT License
18 stars 1 forks source link

Fatal Error: not enough memory, please set -M option greater than 3363 #14

Closed EasternCaveMan closed 7 months ago

EasternCaveMan commented 8 months ago

Hi Roman, I tried to split my data by method Cluster-based double-cold split (C2). but I got this error, the output file are placed correctly

(sail) [vat23@wibi-mickey enzyme_substrate_data]$ ls
All_sequences.fasta                        molecule_data.tsv 
split_C2                                            split_R
(sail) [vat23@wibi-mickey enzyme_substrate_data]$ datasail --e-type M --e-data molecule_data.tsv --e-sim ecfp --f-type P --f-data All_sequences.fasta --f-sim cdhit --output split_C2 --techniques C2 --splits 0.8 0.2  --names train  test --runs 3 --solver SCIP

[23:43:31] SMILES Parse Error: Failed parsing SMILES 'ID63559' for input: 'ID63559'
[23:43:31] SMILES Parse Error: syntax error while parsing: ID63560
[23:43:31] SMILES Parse Error: Failed parsing SMILES 'ID63560' for input: 'ID63560'
rm -rf cdhit_results && mkdir cdhit_results && cd cdhit_results && cd-hit -i ../All_sequences.fasta -o clusters -d 0 -T 128 -c 0.9 -n 5 -l 4  > /scratch/SCRATCH_SAS/vahid/ESP/data/enzyme_substrate_data/split_C2/logs/All_sequences_cdhit_c_0.9_n_5_l_4.log

Fatal Error:
not enough memory, please set -M option greater than 3363

Program halted !!

Traceback (most recent call last):
  File "/home/vat23/miniconda3/envs/sail/bin/datasail", line 11, in <module>
    sys.exit(sail())
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/sail.py", line 227, in sail
    datasail_main(**kwargs)
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/routine.py", line 40, in datasail_main
    f_dataset = cluster(f_dataset, **kwargs)
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/cluster/clustering.py", line 40, in cluster
    similarity_clustering(dataset, kwargs[KW_THREADS], kwargs[KW_LOGDIR])
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/cluster/clustering.py", line 110, in similarity_clustering
    cluster_names, cluster_map, cluster_sim = run_cdhit(dataset, threads, log_dir)
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/cluster/cdhit.py", line 40, in run_cdhit
    return cluster_param_binary_search(
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/cluster/utils.py", line 63, in cluster_param_binary_search
    cluster_names, cluster_map, cluster_sim = trial(
  File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/cluster/cdhit.py", line 101, in cdhit_trial
    raise ValueError("Something went wrong with cd-hit. The output file does not exist.")
ValueError: Something went wrong with cd-hit. The output file does not exist.

input structure for All_sequences.fasta

>ID0
FFEGKNIFVTGGTGLLGKVLVEKILRSTPIGKIYVLVKADDQEAAVDRITKELINSELFRCLKEKHGKYYQAYIRETLIPIVGNICEPNLGMDSDSAHAIMEDVNVIIESAAITTLNERYDVSLEANVNSPQQLMRFAKTCKN
>ID1
MDPHNKGVAEAEFFTEYGEASRYEIQEVIGKGSYGIVGSVIDTHTGERVAIKKINDVFEHVSDATRILREIKKADP

input structure for molecule_data.tsv

   ids                                             SMILES
0  ID0  NC(=O)C1=CN(C=CC1)[C@@H]1O[C@H](COP(O)(=O)OP(O...
1  ID1  NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(O)(=O)OP(O...
2  ID2  NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(O)(=O)OP(O...
3  ID3  NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(O)(=O)OP(O...
4  ID4   N[C@@H](CCC(=O)N[C@@H](CSCO)C(=O)NCC(O)=O)C(O)=O

I am looking forward to hear from you Best Vahid

Old-Shatterhand commented 8 months ago

You can provide some the arguments for the clustering algorithms by --f-args "-M XYZ" (see here).