Closed EasternCaveMan closed 9 months ago
Hi Roman, I tried to split my data by method Cluster-based double-cold split (C2). but I got this error, the output file are placed correctly
(sail) [vat23@wibi-mickey enzyme_substrate_data]$ ls All_sequences.fasta molecule_data.tsv split_C2 split_R (sail) [vat23@wibi-mickey enzyme_substrate_data]$ datasail --e-type M --e-data molecule_data.tsv --e-sim ecfp --f-type P --f-data All_sequences.fasta --f-sim cdhit --output split_C2 --techniques C2 --splits 0.8 0.2 --names train test --runs 3 --solver SCIP [23:43:31] SMILES Parse Error: Failed parsing SMILES 'ID63559' for input: 'ID63559' [23:43:31] SMILES Parse Error: syntax error while parsing: ID63560 [23:43:31] SMILES Parse Error: Failed parsing SMILES 'ID63560' for input: 'ID63560' rm -rf cdhit_results && mkdir cdhit_results && cd cdhit_results && cd-hit -i ../All_sequences.fasta -o clusters -d 0 -T 128 -c 0.9 -n 5 -l 4 > /scratch/SCRATCH_SAS/vahid/ESP/data/enzyme_substrate_data/split_C2/logs/All_sequences_cdhit_c_0.9_n_5_l_4.log Fatal Error: not enough memory, please set -M option greater than 3363 Program halted !! Traceback (most recent call last): File "/home/vat23/miniconda3/envs/sail/bin/datasail", line 11, in <module> sys.exit(sail()) File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/sail.py", line 227, in sail datasail_main(**kwargs) File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/routine.py", line 40, in datasail_main f_dataset = cluster(f_dataset, **kwargs) File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/cluster/clustering.py", line 40, in cluster similarity_clustering(dataset, kwargs[KW_THREADS], kwargs[KW_LOGDIR]) File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/cluster/clustering.py", line 110, in similarity_clustering cluster_names, cluster_map, cluster_sim = run_cdhit(dataset, threads, log_dir) File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/cluster/cdhit.py", line 40, in run_cdhit return cluster_param_binary_search( File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/cluster/utils.py", line 63, in cluster_param_binary_search cluster_names, cluster_map, cluster_sim = trial( File "/home/vat23/miniconda3/envs/sail/lib/python3.10/site-packages/datasail/cluster/cdhit.py", line 101, in cdhit_trial raise ValueError("Something went wrong with cd-hit. The output file does not exist.") ValueError: Something went wrong with cd-hit. The output file does not exist.
input structure for All_sequences.fasta
>ID0 FFEGKNIFVTGGTGLLGKVLVEKILRSTPIGKIYVLVKADDQEAAVDRITKELINSELFRCLKEKHGKYYQAYIRETLIPIVGNICEPNLGMDSDSAHAIMEDVNVIIESAAITTLNERYDVSLEANVNSPQQLMRFAKTCKN >ID1 MDPHNKGVAEAEFFTEYGEASRYEIQEVIGKGSYGIVGSVIDTHTGERVAIKKINDVFEHVSDATRILREIKKADP
input structure for molecule_data.tsv
ids SMILES 0 ID0 NC(=O)C1=CN(C=CC1)[C@@H]1O[C@H](COP(O)(=O)OP(O... 1 ID1 NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(O)(=O)OP(O... 2 ID2 NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(O)(=O)OP(O... 3 ID3 NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(O)(=O)OP(O... 4 ID4 N[C@@H](CCC(=O)N[C@@H](CSCO)C(=O)NCC(O)=O)C(O)=O
I am looking forward to hear from you Best Vahid
You can provide some the arguments for the clustering algorithms by --f-args "-M XYZ" (see here).
--f-args "-M XYZ"
Hi Roman, I tried to split my data by method Cluster-based double-cold split (C2). but I got this error, the output file are placed correctly
input structure for All_sequences.fasta
input structure for molecule_data.tsv
I am looking forward to hear from you Best Vahid