strainscan_build stuck after processing a cluster

enzovalentino commented 2 months ago

Hi, thanks for writing StrainScan, it's very useful!

However, I'm having some problems during the strainscan_build step. I want to build the database on all the 358 NCBI RefSeq genomes of Lacticaseibacillus rhamnosus. I'm using ubuntu version 22.04 (installed as Windows Subsystem for Linux version 2 on a DELL mounting windows 11), with ~25.6 GB memory allocated. Also, I installed strainscan using conda: conda create -n strainscan -c bioconda strainscan=1.0.14 (latest version).

The command I'm trying to use for the building step is: strainscan_build -i rhamnosus_genomes/ -o strainscan_rhamnosus_dir/ -k 31 -t 6 -u 100000 -e 1

The command starts running well, it detects 92 clusters, and then extracts k-mer from most of those clusters. However, after correctly processing one cluster (specifically, cluster named as C5) it stops and gets stuck. It does not start to process the next cluster, and running 'htop' I see that the processes are in S, no longer running. I also left the computer run for one night, but the situation did not change: it does not start processing the next cluster.

I also tried several times running the same command, but it always gets stuck after processing cluster C5.

Any guesses on how to solve this?

Thank you!!

liaoherui commented 2 months ago

Hi, thanks for using StrainScan!

May I know the size of C5? If it's a large cluster (containing many strains), then this problem can be caused by a lack of memory.

enzovalentino commented 2 months ago

Actually, k-mer extraction from cluster C5 always works well. By checking the folder Kmer_Sets_L2/Kmer_Sets I see that there are 7 empty folder (I think those clusters not yet processed), which are:

C26 (31 genomes),
C30 (15 genomes),
C43 (15 genomes),
C46 (49 genomes),
C55 (11 genomes),
C82 (21 genomes),
C86 (76 genomes). Other folders with several genomes (such as C73, which has 18 genomes) was processed successfully. I also attach the txt file with size of all the clusters. hclsMap_95_recls.txt

However, the strange thing is that I don't have a "Killed" message (as it happens in other cases), but the process simply does not run without providing any warning.

What's your idea on how to solve this issue? Thanks

liaoherui commented 2 months ago

It looks like the C5 is a cluster only containing 4 strains. Then, it should not require a lot of memory.

It's a little hard to identify the reason for this problem without data or log. Thus, could you please provide the input reference genomes to us for testing? (via email - heruiliao2-c@my.cityu.edu.hk or any other ways you prefer). In this case, we can debug the program asap and let you know the possible solution. Thanks!

enzovalentino commented 2 months ago

Thank you so much! I just sent you the genomes via mail

liaoherui commented 2 months ago

Hi, I have built the database successfully using our server (100G memory is given).

You can download the database via this link.

As suggested by the attached log file, the construction process requires about 29G memory, which is why your PC can not finish the construction. strainscan.log.txt

If you have any other problems, please let me know. Thanks!

enzovalentino commented 2 months ago

Yes, now it works perfectly. Many thanks!

liaoherui / StrainScan

strainscan_build stuck after processing a cluster #22