sebhtml / ray

Ray -- Parallel genome assemblies for parallel DNA sequencing
http://denovoassembler.sf.net
Other
65 stars 12 forks source link

Crash while allocating memory #239

Open menzzana opened 8 years ago

menzzana commented 8 years ago

Hi

We are using the latest version of Ray, on 2TB RAM nodes and assembling a snake genome. Ray was compiled with GCC 5.1 and with the following make... make PREFIX=/afs/ MAXKMERLENGTH=128 MPICXX=mpic++ HAVE_LIBZ=y MPI_IO=y

Everything worked fine except when running on this large dataset we get...::

Critical exception: The system is out of memory, returned NULL. Requested -2147483648 bytes of type RAY_MALLOC_TYPE_GRID_TABLE


Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.



mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[19389,1],0] Exit code: 42


This seems to be a memory issue, and we could detect that not all of the 2TB RAM was used. We did the following changes in Ray...

. Compilation was done using the intel compiler rather than the GNU compiler

 i-compilers 15.0.2 and intelmpi 5.0.3

. I compiled the code with flag -mcmodel=medium in total...::

 make PREFIX=/afs/<your_preferred_install_directory> MAXKMERLENGTH=128 MPICXX = mpiicpc
      HAVE_LIBZ=y MPI_IO=y CXXFLAGS =' -O3 -std=c++98 -Wall -g -mcmodel=medium'

. Changed line 571 in RayPlatform/RayPlatform/structures/MyHashTable.h

 size_t requiredBytes=sizeof(MyHashTableGroup<KEY,VALUE>)*(size_t)m_numberOfGroups;

. In RayPlatform/RayPlatform/memory/allocator.h

Added #include <stddef.h>

. In RayPlatform/RayPlatform/memory/allocator.h at line 28

void*__Malloc(size_t c,const char*description,bool show);

. In RayPlatform/RayPlatform/memory/allocator.cpp at line 36

void*__Malloc(size_t c,const char*description,bool show){

. In RayPlatform/RayPlatform/memory/allocator.cpp at line 56

printf("%s %i\t%s\t%zu bytes, ret\t%p\t%s\n",__FILE__,__LINE__,__func__,c,a,description);

For consistency perhaps we should not use size_t but rather uint64_t since I see that other part of the sourcecode are using it.

The assembly has nowadays, been running for 18 days, but does not generate any errors at least yet. Do you have any thoughts about this matter?

With kind regards Henric Zazzi