sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
323 stars 189 forks source link

Cant open file: _clustered.clstr #539

Open EricDeveaud opened 3 years ago

EricDeveaud commented 3 years ago

Hello, running Roary/1.13.0

we have a problem running roary, it stops with the following error message:

Iteratively run cd-hit
Cant open file: _clustered.clstrParallel all against all blast

after digging in the code I noticed 1) that Roary.pm does not test the cd-hit command execution return code. 2) that cd-hit failure is due to memory requestedis too low. see when I run th ecd-hit command mannualy

[gensoft@cc118a2a4dd9 CS_pour_ED]$ /opt/gensoft/exe/cd-hit/4.6.1/bin/cd-hit -i _combined_files -o _clustered -T 32 -M 2916 -g 1 -s 1 ^C 256 -c 1
[gensoft@cc118a2a4dd9 CS_pour_ED]$ cd test234/
[gensoft@cc118a2a4dd9 test234]$ /opt/gensoft/exe/cd-hit/4.6.8/bin/cd-hit -i _combined_files -o _clustered -T 32 -M 2916 -g 1 -s 1 -d 256 -c 1
================================================================
Program: CD-HIT, V4.7 (+OpenMP), Jul 12 2021, 08:06:52
Command: /opt/gensoft/exe/cd-hit/4.6.8/bin/cd-hit -i
         _combined_files -o _clustered -T 32 -M 2916 -g 1 -s 1
         -d 256 -c 1

Started: Mon Jul 12 11:57:14 2021
================================================================
                            Output                              
----------------------------------------------------------------
total seq: 956806
longest and shortest : 4323 and 29
Total letters: 303089936
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 426M
Buffer          : 32 X 172M = 5516M
Table           : 2 X 80M = 161M
Miscellaneous   : 12M
Total           : 6117M

Fatal Error:
not enough memory, please set -M option greater than 6217

Program halted !!

[gensoft@cc118a2a4dd9 test234]$ echo $?
1

I'm not enough perl fluent to dig further. hope this helps.

NB for inforamtion here's the roary -a output

[gensoft@cc118a2a4dd9 test234]$ roary -a 
Please cite Roary if you use any of the results it produces:
    Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill,
        "Roary: Rapid large-scale prokaryote pan genome analysis", Bioinformatics, 2015 Nov 15;31(22):3691-3693
    doi: http://doi.org/10.1093/bioinformatics/btv421
        Pubmed: 26198102

2021/07/12 12:01:39 Looking for 'Rscript' - found /opt/gensoft/exe/R/3.6.2/bin/Rscript
2021/07/12 12:01:39 Determined Rscript version is 3.6
2021/07/12 12:01:39 Looking for 'awk' - found /usr/bin/awk
2021/07/12 12:01:39 Looking for 'bedtools' - found /opt/gensoft/exe/bedtools/2.29.2/bin/bedtools
2021/07/12 12:01:39 Determined bedtools version is 2.29
2021/07/12 12:01:39 Looking for 'blastp' - found /opt/gensoft/exe/blast+/2.10.0/bin/blastp
2021/07/12 12:01:39 Determined blastp version is 2.10.0
2021/07/12 12:01:39 Looking for 'grep' - found /usr/bin/grep
2021/07/12 12:01:39 Optional tool 'kraken' not found in your $PATH
2021/07/12 12:01:39 Optional tool 'kraken-report' not found in your $PATH
2021/07/12 12:01:39 Looking for 'mafft' - found /opt/gensoft/exe/mafft/7.453/bin/mafft
2021/07/12 12:01:40 Determined mafft version is 7.453
2021/07/12 12:01:40 Looking for 'makeblastdb' - found /opt/gensoft/exe/blast+/2.10.0/bin/makeblastdb
2021/07/12 12:01:40 Determined makeblastdb version is 2.10.0
2021/07/12 12:01:40 Looking for 'mcl' - found /opt/gensoft/exe/mcl/14-137/bin/mcl
2021/07/12 12:01:40 Determined mcl version is 14-137
2021/07/12 12:01:40 Looking for 'parallel' - found /opt/gensoft/exe/parallel/20200222/bin/parallel
2021/07/12 12:01:40 Determined parallel version is 20200222
2021/07/12 12:01:40 Looking for 'prank' - found /opt/gensoft/exe/prank/170427/bin/prank
2021/07/12 12:01:40 Determined prank version is 170427
2021/07/12 12:01:40 Looking for 'sed' - found /usr/bin/sed
2021/07/12 12:01:40 Looking for 'cd-hit' - found /opt/gensoft/exe/cd-hit/4.6.8/bin/cd-hit
2021/07/12 12:01:40 Determined cd-hit version is 4.7
2021/07/12 12:01:40 Looking for 'FastTree' - found /opt/gensoft/exe/FastTree/2.1.11/bin/FastTree
2021/07/12 12:01:40 Determined FastTree version is 2.1
2021/07/12 12:01:40 Roary version 1.7.7

regards

Eric

EricDeveaud commented 3 years ago

back on this topic. problem is the memory computation performed by Roary::External::Cdhit which is wrong regarding the new cd-hit memory estimation see how version cd-hit performed the memory estimation

size_t SequenceDB::MinimalMemory( int frag_no, int bsize, int T, const Options & options, size_t extra )
{
        int N = sequences.size();
        int F = frag_no < MAX_TABLE_SEQ ? frag_no : MAX_TABLE_SEQ;
        size_t mem_need = 0;
        size_t mem, mega = 1000000;
        int table = T > 1 ? 2 : 1;

        printf( "\nApproximated minimal memory consumption:\n" );
        mem = N*sizeof(Sequence) + total_desc + N + extra;
        if( options.store_disk == false ) mem += total_letter + N;
        printf( "%-16s: %zuM\n", "Sequence", mem/mega );
        mem_need += mem;

        mem = bsize;
        printf( "%-16s: %i X %zuM = %zuM\n", "Buffer", T, mem/mega, T*mem/mega );
        mem_need += T*mem;

        mem = F*(sizeof(Sequence*) + sizeof(IndexCount)) + NAAN*sizeof(NVector<IndexCount>);
        printf( "%-16s: %i X %zuM = %zuM\n", "Table", table, mem/mega, table*mem/mega );
        mem_need += table*mem;

        mem = sequences.capacity()*sizeof(Sequence*) + N*sizeof(int);
        mem += Comp_AAN_idx.size()*sizeof(int);
        printf( "%-16s: %zuM\n", "Miscellaneous", mem/mega );
        mem_need += mem;

        printf( "%-16s: %zuM\n\n", "Total", mem_need/mega );

        if(options.max_memory and options.max_memory < mem_need + 50*table ){
                char msg[200];
                sprintf( msg, "not enough memory, please set -M option greater than %zu\n",
                                50*table + mem_need/mega );
                bomb_error(msg);
        }
        return mem_need;
}

so just take in account number of characters in input file (_combined_files) as $memory_required = -s $filename; does is no longer sufficiant.

regards

Eric

EricDeveaud commented 3 years ago

I finally hacked lib/Bio/Roary/External/Cdhit.pm

to force memory to unlimited ;-) harsh but functional

--- lib/Bio/Roary/External/Cdhit.pm.ori 2021-07-16 08:37:29.333069603 +0000
+++ lib/Bio/Roary/External/Cdhit.pm     2021-07-16 08:36:11.646064928 +0000
@@ -58,7 +58,10 @@
 {
   my ($self) = @_;
   my $memory_to_cdhit = int($self->memory_in_mb *0.9);
-  return $memory_to_cdhit;
+#  return $memory_to_cdhit;
+# memory estimation is wrong
+# force -M to 0. fix https://github.com/sanger-pathogens/Roary/issues/539
+  return 0;
 }

 sub clusters_filename
ndusek commented 8 months ago

This thread saved me a ton of time. Thanks, @EricDeveaud !