[Question] ALFATClust stuck at subsets processed stage for multiple days

YX-Xiang commented 1 year ago

Hi, Thanks always for the nice tool!

I have clustered a real sequence set using ALFATClust. But now when I run ./alfatclust.py -i ${fasta_file} -o ${alfat_dir}/ALFATclu_$T.fasta -p -l $T -t 1 > log , I find the program stuck at the first subset processed stage for multiple days. When I try to use -t 32, the program processes in subsets at the beginning, but still stalls in the middle.

I have only encountered this problem so far when running these two small datasets. Here is the download link I have attached. https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/swissprot.tar.gz https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/pdbaa.gz To speed up the run, I removed sequences in the sequence set that were longer than 200. Thus after processing, the swissprot dataset contains 64264 sequences and the pdbaa dataset contains 138654 sequences.

Is it correct? Is it a problem with the data set?

Thanks in advance.

jimmykhchiu commented 1 year ago

Thanks for reporting the problem you are experiencing as this helps improve ALFATClust. May I know the value(s) of T ($T) you using? I am going to examine this issue in details.

YX-Xiang commented 1 year ago

I need to use the thresholds 0.5, 0.6, 0.7, 0.8, 0.9 respectively, but the program is currently stalled at 0.5 and the rest of the thresholds have not yet been executed

jimmykhchiu commented 1 year ago

I recommend running from T = 0.9 first. The reason is that, since each of both datasets is too large (in terms of space requirement) to construct a single pairwise sequence similarity matrix M (of dimensions N x N where N is the number of sequences), ALFATClust first runs MMseq2 to break (i.e. pre-cluster) the input dataset into smaller subsets, and the cutoff threshold provided to MMseq2 is T. Therefore, it can be imagined that the lower the value of T the larger the subsets are, and hence it is possible that some of these subsets are still too large for ALFATClust to process efficiently. On the other hand, I am not sure whether T = 0.5 can deliver biologically significant clusters.

When you try different values of T from 0.9 down to certain lower value, you may observe the running time increases non-linearly with decreasing T. If you know how to run MMseq2 (available in the Conda or Docker environment for ALFATClust), you may run it in a standalone manner using T as the cutoff, the MMSeq2 cluster sizes are simply the sizes of the subsets for ALFATClust.

YX-Xiang commented 1 year ago

Thank you for your advice.

I have now chosen 0.9 as its threshold first and find that the program is still stuck on the first subset. I had run Linclust alone and got the complete result for the 0.5-0.9 threshold. I open the cluster results produced by Linclust at the 0.9 threshold and find that there are no oversized clusters in there. This struck me as very odd.

jimmykhchiu commented 1 year ago

Okay, then I will check it with the datasets.

jimmykhchiu commented 1 year ago

Hi @CuteYisin,

After testing with both PDB and swissprot sequences, it is believed that the main cause of the problem is the multiple whitespaces in the sequence header that might raise error during sequence similarity estimation. Besides, some sequences are not accepted for clustering in their current states. The tool has been enhanced to report all these issues. More importantly, a sequence pre-processing workflow has been developed to filter sequences that cannot be clustered and to eliminate the whitespace issue. You may git pull the repository again to update the tools (also need to rebuild the image and create new container if you are using within Docker) and get the latest documentation. In short, you only need to run like (PDB for example):

./filter_seqs.py -i pdbaa -o pdbaa.pass.fa

and then

./replace_seq_header_spaces.py -i pdbaa.pass.fa -o pdbaa.pass.no_whitespace.fa

You can then run the tool using the file pdbaa.pass.no_whitespace.fa. The clustering takes ~15 mins to complete on my 8-core Intel-based MacBook (without generating the cluster report). Of course you may find some sequences excluded after running the first command, and most of them are difficult to be "corrected". Nevertheless, this saves your time guessing which sequences need to be removed.

Hope you can proceed with your task after this update.

YX-Xiang commented 1 year ago

I have tested the updated tool and can confirm that it now works as intended. Your guidance on how to use the new version was also very clear and helpful!

phglab / ALFATClust

[Question] ALFATClust stuck at subsets processed stage for multiple days #7