h836472 / ContScout

ContScout sequence contamination filter tool
GNU General Public License v3.0
15 stars 2 forks source link

memory limit error #7

Closed Sanrrone closed 2 months ago

Sanrrone commented 2 months ago

Dear, along with greeting you, I am getting error related to the memory even though I have the resources.

error log:

Error: Option is not permitted for this workflow: memory-limit
Error in file(file, "rt") : cannot open the connection
Calls: read.table -> file
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'Streptococcus_mitis_SK1080_tax_1008453_28Jul_2024_22_56/RefTaxLookup.ABC': No such file or directory
Execution halted

stdout log:

This is ContScout, a contamination remover tool written in R.

Loading R libraries.

Temporary dir set to:/tmp.
Pre-processing NCBI taxon database.
Query taxon lineage:
 family:1300:Streptococcaceae
 order:186826:Lactobacillales
 class:91061:Bacilli
 phylum:1239:Bacillota
 kingdom:vk_1239:NA
 superkingdom:2:Bacteria

Analysis started at 2024-07-28 22:56:32
Command: -u /databases -d uniref100 -i query -q 1008453 -m 250G --pci 50 -c 2 -t /tmp -a diamond
Databases used:
Name: uniref100
 Source: Uniprot
 NumProts: 398132654
 DB_CRC: 210bc2cd
 Tax_CRC: a456430b
 MMSeqs_DB: uniref100/78df74ca/mmseqs/210bc2cd_uniref100_2024_02_tax.taxdb
 Diamond_DB: uniref100/78df74ca/diamond/210bc2cd_uniref100_2024_02_tax.taxdb.dmnd
 Creation_Date: 2024-04-28_23:53:17
Now reading fasta headers file.
Now reading annotation file.
Search command:
diamond blastp -d /databases/uniref100/78df74ca/diamond/210bc2cd_uniref100_2024_02_tax.taxdb.dmnd -q protein_seq/1008453_Streptococcus_mitis.faa -o Streptococcus_mitis_SK1080_tax_1008453_28Jul_2024_22_56/RefTaxLookup.ABC --max-target-seqs 100 -f 6 qseqid sseqid bitscore qlen nident --threads 2  --memory-limit 250

command line:

singularity exec -B $new/software/ContScout:/databases -B $new/tmp:/tmp -B $(pwd)/query:/query $new/software/ContScout/contscout_latest.sif ContScout -u /databases -d uniref100 -i query -q $tid -m 250G --pci 50 -c 2 -t /tmp -a diamond

is it a memory issue?

thanks in advance, Sandro

h836472 commented 2 months ago

Dear Sandro,

I will try to reproduce / diagnose the error but already think that it is the Diamond search step that fails with the error message: "Option is not permitted for this workflow: memory-limit". Need to look into this, but the error suggests that Diamond, with blastp search, does not support memory limitation. In the meantime, here are some options you might wish to try.

1., if you really need memory limitation, please try MMSeqs as the search tool (that was the configuration, we used on HPC nodes for multiple simultaneous CS runs with shared memory.) 2., start ContScout without specifying any memory limit option.

If you plan to run a single instance of ContScout on your server, you can safely omit the memory limit option. This option was included only for the special cases where multiple CS instances run simultaneously on the same machine sharing a common large (>786 GB) memory pool. There, by default, each instance would configure itself for the use all available RAM, causing clashes among the pipelines, Thus we manually set the memory limit that tells the aligner to proceed with smaller data chunks, ensuring that all CS instances nicely fit next to each other.

Balazs

On Sun, 28 Jul 2024 at 22:20, Sandro Valenzuela @.***> wrote:

Dear, along with greeting you, I am getting error related to the memory even though I have the resources.

error log:

Error: Option is not permitted for this workflow: memory-limit Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'Streptococcus_mitis_SK1080_tax_1008453_28Jul_2024_22_56/RefTaxLookup.ABC': No such file or directory Execution halted

stdout log:

This is ContScout, a contamination remover tool written in R.

Loading R libraries.

Temporary dir set to:/tmp. Pre-processing NCBI taxon database. Query taxon lineage: family:1300:Streptococcaceae order:186826:Lactobacillales class:91061:Bacilli phylum:1239:Bacillota kingdom:vk_1239:NA superkingdom:2:Bacteria

Analysis started at 2024-07-28 22:56:32 Command: -u /databases -d uniref100 -i query -q 1008453 -m 250G --pci 50 -c 2 -t /tmp -a diamond Databases used: Name: uniref100 Source: Uniprot NumProts: 398132654 DB_CRC: 210bc2cd Tax_CRC: a456430b MMSeqs_DB: uniref100/78df74ca/mmseqs/210bc2cd_uniref100_2024_02_tax.taxdb Diamond_DB: uniref100/78df74ca/diamond/210bc2cd_uniref100_2024_02_tax.taxdb.dmnd Creation_Date: 2024-04-28_23:53:17 Now reading fasta headers file. Now reading annotation file. Search command: diamond blastp -d /databases/uniref100/78df74ca/diamond/210bc2cd_uniref100_2024_02_tax.taxdb.dmnd -q protein_seq/1008453_Streptococcus_mitis.faa -o Streptococcus_mitis_SK1080_tax_1008453_28Jul_2024_22_56/RefTaxLookup.ABC --max-target-seqs 100 -f 6 qseqid sseqid bitscore qlen nident --threads 2 --memory-limit 250

command line:

singularity exec -B $new/software/ContScout:/databases -B $new/tmp:/tmp -B $(pwd)/query:/query $new/software/ContScout/contscout_latest.sif ContScout -u /databases -d uniref100 -i query -q $tid -m 250G --pci 50 -c 2 -t /tmp -a diamond

is it a memory issue?

thanks in advance, Sandro

— Reply to this email directly, view it on GitHub https://github.com/h836472/ContScout/issues/7, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL2BSTE5THOL4NTPMGBWBUTZOVHBNAVCNFSM6AAAAABLTCR65SVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZTIMBZGYYDINY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

h836472 commented 2 months ago

Dear Sandro,

I have managed to reproduce the error you reported. The issue is indeed caused by diamond in a rather weird way. Some versions support ram limitation with blastp lookup while others raise the error message that you had encountered. How nice!

version 2.1.8, that is integrated in the current ContScout docker "latest" is affected current most recent diamond 2.1.9 version is affected too a prior, 2.0.4 release runs without any problem.

What can we do now?

1,. I shall definitely send a bug report to Diamond developers as they should know about the issue. 2., I believe you can safely omit the mem limit option (see my previous comment with details) 3., if needed, you can use the mem limit functionality with MMSeqs with the current ContScout. 4., if you wish to use diamond with mem_limit, you can force the use of version 2.0.4 via modifying the docker script line RUN curl -s https://api.github.com/repos/bbuchfink/diamond/releases/latest https://api.github.com/repos/bbuchfink/diamond/releases/latest | sed -Ene '/^ "tag_name": "v(.+)",$/s//\1/p' >/data/version.txt to RUN echo 2.0.4 >/data/version.txt

I could also hard-wire the version number in the official docker script and image too but I would prefer to see the diamond team actually resolving the case in the upcoming diamond releases.

Yours,

Balazs

On Sun, 28 Jul 2024 at 22:20, Sandro Valenzuela @.***> wrote:

Dear, along with greeting you, I am getting error related to the memory even though I have the resources.

error log:

Error: Option is not permitted for this workflow: memory-limit Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'Streptococcus_mitis_SK1080_tax_1008453_28Jul_2024_22_56/RefTaxLookup.ABC': No such file or directory Execution halted

stdout log:

This is ContScout, a contamination remover tool written in R.

Loading R libraries.

Temporary dir set to:/tmp. Pre-processing NCBI taxon database. Query taxon lineage: family:1300:Streptococcaceae order:186826:Lactobacillales class:91061:Bacilli phylum:1239:Bacillota kingdom:vk_1239:NA superkingdom:2:Bacteria

Analysis started at 2024-07-28 22:56:32 Command: -u /databases -d uniref100 -i query -q 1008453 -m 250G --pci 50 -c 2 -t /tmp -a diamond Databases used: Name: uniref100 Source: Uniprot NumProts: 398132654 DB_CRC: 210bc2cd Tax_CRC: a456430b MMSeqs_DB: uniref100/78df74ca/mmseqs/210bc2cd_uniref100_2024_02_tax.taxdb Diamond_DB: uniref100/78df74ca/diamond/210bc2cd_uniref100_2024_02_tax.taxdb.dmnd Creation_Date: 2024-04-28_23:53:17 Now reading fasta headers file. Now reading annotation file. Search command: diamond blastp -d /databases/uniref100/78df74ca/diamond/210bc2cd_uniref100_2024_02_tax.taxdb.dmnd -q protein_seq/1008453_Streptococcus_mitis.faa -o Streptococcus_mitis_SK1080_tax_1008453_28Jul_2024_22_56/RefTaxLookup.ABC --max-target-seqs 100 -f 6 qseqid sseqid bitscore qlen nident --threads 2 --memory-limit 250

command line:

singularity exec -B $new/software/ContScout:/databases -B $new/tmp:/tmp -B $(pwd)/query:/query $new/software/ContScout/contscout_latest.sif ContScout -u /databases -d uniref100 -i query -q $tid -m 250G --pci 50 -c 2 -t /tmp -a diamond

is it a memory issue?

thanks in advance, Sandro

— Reply to this email directly, view it on GitHub https://github.com/h836472/ContScout/issues/7, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL2BSTE5THOL4NTPMGBWBUTZOVHBNAVCNFSM6AAAAABLTCR65SVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZTIMBZGYYDINY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

h836472 commented 2 months ago

Dear Sandro,

I have placed an issue ticket for diamond at GitHub. However, after looking into the command help outputs with various versions, it looks like the support for --memory_limit has been intentionally removed from the diamond blastp analysis workflow. At the same time, it can still be found for diamond cluster workflow. I have just asked the developers about their future plans: is the support for --memory_limit indeed discontinued for blastp search? We shall wait for their response but I think I will have to modify ContScout so that it only offers the mem-limit feature with MMSeqs.

Balazs

On Mon, 29 Jul 2024 at 09:48, Balint Balazs @.***> wrote:

Dear Sandro,

I have managed to reproduce the error you reported. The issue is indeed caused by diamond in a rather weird way. Some versions support ram limitation with blastp lookup while others raise the error message that you had encountered. How nice!

version 2.1.8, that is integrated in the current ContScout docker "latest" is affected current most recent diamond 2.1.9 version is affected too a prior, 2.0.4 release runs without any problem.

What can we do now?

1,. I shall definitely send a bug report to Diamond developers as they should know about the issue. 2., I believe you can safely omit the mem limit option (see my previous comment with details) 3., if needed, you can use the mem limit functionality with MMSeqs with the current ContScout. 4., if you wish to use diamond with mem_limit, you can force the use of version 2.0.4 via modifying the docker script line RUN curl -s https://api.github.com/repos/bbuchfink/diamond/releases/latest https://api.github.com/repos/bbuchfink/diamond/releases/latest | sed -Ene '/^ "tag_name": "v(.+)",$/s//\1/p' >/data/version.txt to RUN echo 2.0.4 >/data/version.txt

I could also hard-wire the version number in the official docker script and image too but I would prefer to see the diamond team actually resolving the case in the upcoming diamond releases.

Yours,

Balazs

On Sun, 28 Jul 2024 at 22:20, Sandro Valenzuela @.***> wrote:

Dear, along with greeting you, I am getting error related to the memory even though I have the resources.

error log:

Error: Option is not permitted for this workflow: memory-limit Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'Streptococcus_mitis_SK1080_tax_1008453_28Jul_2024_22_56/RefTaxLookup.ABC': No such file or directory Execution halted

stdout log:

This is ContScout, a contamination remover tool written in R.

Loading R libraries.

Temporary dir set to:/tmp. Pre-processing NCBI taxon database. Query taxon lineage: family:1300:Streptococcaceae order:186826:Lactobacillales class:91061:Bacilli phylum:1239:Bacillota kingdom:vk_1239:NA superkingdom:2:Bacteria

Analysis started at 2024-07-28 22:56:32 Command: -u /databases -d uniref100 -i query -q 1008453 -m 250G --pci 50 -c 2 -t /tmp -a diamond Databases used: Name: uniref100 Source: Uniprot NumProts: 398132654 DB_CRC: 210bc2cd Tax_CRC: a456430b MMSeqs_DB: uniref100/78df74ca/mmseqs/210bc2cd_uniref100_2024_02_tax.taxdb Diamond_DB: uniref100/78df74ca/diamond/210bc2cd_uniref100_2024_02_tax.taxdb.dmnd Creation_Date: 2024-04-28_23:53:17 Now reading fasta headers file. Now reading annotation file. Search command: diamond blastp -d /databases/uniref100/78df74ca/diamond/210bc2cd_uniref100_2024_02_tax.taxdb.dmnd -q protein_seq/1008453_Streptococcus_mitis.faa -o Streptococcus_mitis_SK1080_tax_1008453_28Jul_2024_22_56/RefTaxLookup.ABC --max-target-seqs 100 -f 6 qseqid sseqid bitscore qlen nident --threads 2 --memory-limit 250

command line:

singularity exec -B $new/software/ContScout:/databases -B $new/tmp:/tmp -B $(pwd)/query:/query $new/software/ContScout/contscout_latest.sif ContScout -u /databases -d uniref100 -i query -q $tid -m 250G --pci 50 -c 2 -t /tmp -a diamond

is it a memory issue?

thanks in advance, Sandro

— Reply to this email directly, view it on GitHub https://github.com/h836472/ContScout/issues/7, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL2BSTE5THOL4NTPMGBWBUTZOVHBNAVCNFSM6AAAAABLTCR65SVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZTIMBZGYYDINY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Sanrrone commented 2 months ago

Dear Balazs, thank you for the support, it is indeed a surprise about the diamond "bug"¿?. Removing memory limit flag make the software works :tada: . Although the times are a bit slow, I setted now uniref90 and 4 cores and it took around 1-2 bacteria/hour using diamond, and 1 bac every 3 hours using mmseq. That's why I picked diamond for the task over mmseq. I have a list of 4000 bacterias so I might need multiple instances. I will take the diamond downgrade option for now.

thank you very much!

h836472 commented 2 months ago

Dear Sandro,

What are the specs of the server you are using? If you have 256 GB RAM, I would expect way more than 4 usable CPUs. Aligners scale up quite nicely, so you can gain extra speed if you give more CPU. In our benchmarks, we had a yield around 2-3 bacterial genomes per hour with MMSeqs using 24 cores with the RAM being limited to 150 GB. The issue with MMSeqs is that as the available RAM goes down, smaller data chunk sizes will be used that translate to increased numbers of processing rounds.

Either way, a list of 4000 bacterial genomes sounds like a good reason to request supercomputer access. One example, supporting academic HPC access in Europe, is EuroHPC. More info: https://eurohpc-ju.europa.eu/access-our-supercomputers/access-policy-and-faq_en

Yours,

Balazs

On Mon, 29 Jul 2024 at 15:39, Sandro Valenzuela @.***> wrote:

Dear Balazs, thank you for the support, it is indeed a surprise about the diamond "bug"¿?. Removing memory limit flag make the software works 🎉 . Although the times are a bit slow, I setted now uniref90 and 4 cores and it took around 1-2 bacteria/hour using diamond, and 1 bac every 3 hours using mmseq. That's why I picked diamond for the task over mmseq. I have a list of 4000 bacterias so I might need multiple instances. I will take the diamond downgrade option for now.

thank you very much!

— Reply to this email directly, view it on GitHub https://github.com/h836472/ContScout/issues/7#issuecomment-2255978292, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL2BSTEPFHOXPM6RWH5P5XDZOZAYZAVCNFSM6AAAAABLTCR65SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJVHE3TQMRZGI . You are receiving this because you commented.Message ID: @.***>

Sanrrone commented 2 months ago

I have access to a HPC server, there is a node with 1.5T of ram if needed. My only limitation is the time (deadlines are coming). Even though I could run 10 instances of mmseq, that means it would take around 5 days (assuming no errors in between). Which I think it could be better if I can make diamond works.

h836472 commented 2 months ago

Dear Sandro,

If you manage to downgrade Diamond within the CS image, you can easily send 5-6 CS instances with diamond on a single node, with the memory limited to 250G. (suggestion: 8 instances, each with 180 GB RAM). Depending on the CPU configuration, I expect you can assign at least 8 CPUs per instance. Reminder, please do not go beyond the physical CPU core numbers. (I.e. forget about HyperThreading extra "virtual" cores). Let me know if you need help with diamond downgrade within CS. On HPC, I usually use a Python MPI task feeder (taken from Luca-s: https://github.com/luca-s/mpi-master-slave/blob/master/examples/example5.py), That way, one can spawn several HPC nodes, each with multiple worker "units". Then, the unit with MPI rank number 0 will be the boss that reads the "list of tasks" file and assigns tasks to worker units one by one on a round robin manner. Sending the tasks and "I am done" messages via MPI is quick and efficient thus the system scales up real nice. Modifying Luca's original script is not difficult but if you need my example version, please let me know. With such an MPI multi-node infrastructure, I speculate, one could finish with thousands of bacterial genomes in the matter of a few days.

Balazs

On Mon, 29 Jul 2024 at 16:21, Sandro Valenzuela @.***> wrote:

I have access to a HPC server, there is a node with 1.5T of ram if needed. My only limitation is the time (deadlines are coming). Even though I could run 10 instances of mmseq, that means it would take around 5 days (assuming no errors in between). Which I think it could be better if I can make diamond works.

— Reply to this email directly, view it on GitHub https://github.com/h836472/ContScout/issues/7#issuecomment-2256080557, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL2BSTGLI5ZCQHZLV34BIKDZOZFWRAVCNFSM6AAAAABLTCR65SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJWGA4DANJVG4 . You are receiving this because you commented.Message ID: @.***>

h836472 commented 2 months ago

Code has been modified so that ContScout only accepts the -m / --memlimit option when MMSeqs aligner is selected. Development version has been published in the "develop" branch. After further testing, it will enter the "main" branch.