phglab / ALFATClust

Biological sequence clustering tool with dynamic threshold
GNU General Public License v3.0
23 stars 6 forks source link

unknown format fasta #1

Closed mmpust closed 2 years ago

mmpust commented 2 years ago

I am trying to run ALFATClust with a FASTA.fa (DNA) input file:

>k141_4699_1 # 23 # 301 # -1 # ID=1_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.573
CAGGATGAGCGCAGCGGCGAGGTGCTCATGCTCGGCTATGCGAACGAGCAGGCGCTTCAGCTGACGATGG
ATACCGGCACGGCGTGGTTTTTCTCGCGCTCGCGCCAGAAGCTGTGGAACAAGGGCGAGACCTCGGGCAA
TTTCATTTTTGTGAAGAAGATCCTGTCCGACTGTGACGATGATACGCTGATCTATGTCGGCACGCCCAAG
GGTCCGGTCTGCCACACAGGCCACCGCACCTGCTTTTTCACGACGCTGTGGGAAAAAGACGAGAAGTAA
>k141_0_1 # 3 # 371 # -1 # ID=2_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.472
GTGTCCCATGACATTAAGACTCCACTGACATCGATCATCAACTATGTGGATCTGCTGGAAAAAGAAGAAC
TGCACAACGAGACAGCGCAGGAGTATTTAGAGGTGTTAGAGCGCCAGTCAAGCCGATTGAAAAAGCTGAT
CGAAGACCTGATCGAGGCTTCCAAGGCGTCCACCGGAAACCTTCCGGTACATTTAGAGCGGTTAGAAGCC
GGGATATTTATGACACAGACGGTCGGGGAATTTGAGGAAAAGACAAAAGAGGCAGGACTTGATCTTGTGA
TCGAAAAGCCGGAGACACCGGTCTATATCATGGCGGACAGCAGACATTTCTGGCGTGTGATCGATAACCT
GATGAATAATATCTGCAAA
>k141_23495_1 # 3 # 242 # 1 # ID=3_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.388
TCTCCCAGTTATCTGTCCAGAGTATTCAAACAGAATCTTGGAGTATCCATAAGTGATTACATCCGCGAGA
AAAAAATCGAAAAAGCCACTCACCTGCTCCGATACTCTGATAAAGCAGTAGTTGATATCGCCAATTATCT
GAATTTCTCTTCACAGAGTCATTTTATCCAGATCTTTGAAAACTTCACCGGCCTTACCCCAAAGAAGTAC
CGTGACAAATATTACAAATCAATGTGGTAA

I get the following error:

Submit FASTQ files for pre-processing
Submit individual protein predictions for protein clustering
---------------------------------------------
Estimated similarity range = [0.95, 0.75]
Estimated similarity step size = 0.025
Default DNA k-mer size = 17
Default protein k-mer size = 9
Default DNA sketch size = 2000
Default protein sketch size = 2000
Min. estimated similarity considered = 0.55
No. of threads = 16
---------------------------------------------

Validating input sequence file 'FASTA.fa'...
Pre-clustering sequences into subsets...
32513 individual subsets to be clustered

Process aborted due to error occurred: Unknown format fasta

What is the problem? Thanks, Marie

mmpust commented 2 years ago

Okay, I solved the error. You have to provide the full path plus FASTA file for -i and -o parameters. Like this:

-i /home/input/path/FASTA.fa 

You could modify the parameter description.

Thanks for developing the tool!

Marie

mmpust commented 2 years ago

If I now add additional (optional) parameters, the same error message appears

python /home/programs/ALFATClust/main/alfatclust.py \
 -i /home/seq_file_path/fasta.fa \
 -o /home/output/fasta_clust.fa \
 --seed 1 \
 --evaluate  /home/evaluate/evaluate.csv

Error message

---------------------------------------------
Estimated similarity range = [0.95, 0.75]
Estimated similarity step size = 0.025
Default DNA k-mer size = 17
Default protein k-mer size = 9
Default DNA sketch size = 2000
Default protein sketch size = 2000
Min. estimated similarity considered = 0.55
No. of threads = 16
---------------------------------------------

Validating input sequence file '/home/seq_file_path/fasta.fa' ..
Pre-clustering sequences into subsets...
38526 individual subsets to be clustered

Process aborted due to error occurred: Unknown format fasta

Do you have an idea why this is happening? If I run the same without parameter 'evaluate', it runs well and the final output is:

---------------------------------------------
Estimated similarity range = [0.95, 0.75]
Estimated similarity step size = 0.025
Default DNA k-mer size = 17
Default protein k-mer size = 9
Default DNA sketch size = 2000
Default protein sketch size = 2000
Min. estimated similarity considered = 0.55
No. of threads = 16
---------------------------------------------

Validating input sequence file '/home/seq_file_path/fasta.fa' ..
Pre-clustering sequences into subsets...
38539 individual subsets to be clustered
38539 / 38539 subsets processed
Process completed. No. of sequence clusters = 45856

Output file

#Cluster 1
k141_14861_1 # 2 # 136 # -1 # ID=5400_1;partial=10;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.407
#Cluster 2
k141_33498_1 # 1 # 384 # -1 # ID=5402_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.490
#Cluster 3
k141_33499_1 # 1 # 393 # 1 # ID=5403_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.412

But where are the clustered sequences?

jimmykhchiu commented 2 years ago

Hi! Knowing that you run ALFATClust in conda, we added a conda environment so that users may easily create their own without worrying the package compatibility. We also updated the sequence cluster evaluation module to fix an error that may occur in some scenarios.

We recommend that you create your conda environment using our environment file. Please let us know if the problem persists after update. Thanks.