uwsampa / grappa

Grappa: scaling irregular applications on commodity clusters
grappa.io
BSD 3-Clause "New" or "Revised" License
159 stars 50 forks source link

Segfault during loading graph datafile on single node #176

Open alexfrolov opened 10 years ago

alexfrolov commented 10 years ago

Hi!

I received segfault when tried to run this:

frolo@A11:~/grappa.master/build/Make+Release> /local/scratch/frolo/igor/20755/grappa_srun --nnode 1 --ppn 1 --partition all --no-freeze-on-error -- /local/scratch/frolo/igor/20755/bfs.exe --metrics --vmodule graphlab=1 --max_degree_source --global_memory_use_hugepages=0 --num_starting_workers=512 --loop_threshold=1024 --aggregator_autoflush_ticks=3000000 --aggregator_max_flush=0 --periodic_poll_ticks=200000 --chunk_size=100 --load_balance=steal --flush_on_idle=0 --poll_on_idle=1 --rdma_workers_per_core=16 --target_size=4096 --rdma_buffers_per_core=16 --rdma_threshold=64 --shared_pool_chunk_size=8192 --stack_size=524288 --global_heap_fraction=0.15 --shared_pool_memory_fraction=0.35 --flatten_completions=1 --log2_concurrent_receives=7 --path=graph.kronecker.scale-24.bintsv4 --format=bintsv4 --max_iterations=1024 --trials=1
srun --cpu_bind=rank --label --kill-on-bad-exit --task-prolog=/local/scratch/frolo/igor/20755/srun_prolog.rb --task-epilog=/local/scratch/frolo/igor/20755/srun_epilog.sh --partition=all --nodes=1 --ntasks-per-node=1 --  /local/scratch/frolo/igor/20755/bfs.exe --metrics --vmodule graphlab=1 --max_degree_source --global_memory_use_hugepages=0 --num_starting_workers=512 --loop_threshold=1024 --aggregator_autoflush_ticks=3000000 --aggregator_max_flush=0 --periodic_poll_ticks=200000 --chunk_size=100 --load_balance=steal --flush_on_idle=0 --poll_on_idle=1 --rdma_workers_per_core=16 --target_size=4096 --rdma_buffers_per_core=16 --rdma_threshold=64 --shared_pool_chunk_size=8192 --stack_size=524288 --global_heap_fraction=0.15 --shared_pool_memory_fraction=0.35 --flatten_completions=1 --log2_concurrent_receives=7 --path=graph.kronecker.scale-24.bintsv4 --format=bintsv4 --max_iterations=1024 --trials=1
srun: cluster configuration lacks support for cpu binding
0: I0708 16:18:39.879088  5976 bfs.cpp:96] loading graph.kronecker.scale-24.bintsv4
0: *** Aborted at 1404821919 (unix time) try "date -d @1404821919" if you are using GNU date ***
0: PC: @     0x7f580231cd6d google::DumpStackTrace()
0:     @           0x49b693 Grappa::impl::failure_sighandler()
0:     @     0x7f58020f87c0 (unknown)
0:     @           0x482cdd Allocator::malloc()
0:     @           0x4885bf _ZN6Grappa4impl4callIZN15GlobalAllocator13remote_mallocEmEUlvE_13GlobalAddressIvEEET0_sT_MS7_KFS6_vE
0:     @           0x4c506f Grappa::global_alloc<>()
0:     @           0x4c0923 Grappa::TupleGraph::load_generic()
0:     @           0x4c3911 Grappa::TupleGraph::Load()
0:     @           0x477f28 _ZZ4mainENKUlvE_clEv.isra.345
0:     @           0x47818e _ZN6Grappa4implL18task_functor_proxyIZNS_3runIZ4mainEUlvE_EEvT_EUlvE_EEvmmm
0:     @           0x4bae73 Grappa::impl::workerLoop()
0:     @           0x4b757e Grappa::impl::tramp()
0:     @           0x4bc9cc (unknown)
0: I0708 16:18:39.881263  5976 Grappa.cpp:212] Exiting due to signal 11
srun: error: A11: task 0: Segmentation fault (core dumped)
srun: Terminating job step 179.0

Running on greater number of nodes (2, 4, 8) is ok.

Best, Alex

bholt commented 10 years ago

Probably running out of memory with only one node. If you enable "--v=1" logging, what does it say you have for free space?

alexfrolov commented 10 years ago

--v=1 results:

0: I0711 13:41:53.870179  9370 Grappa.cpp:504] 
0: -------------------------
0: Shared memory breakdown:
0:   locale shared heap total:     39.3379 GB
0:   locale shared heap per core:  39.3379 GB
0:   communicator per core:        0.125 GB
0:   tasks per core:               0.0161209 GB
0:   global heap per core:         5.90069 GB
0:   aggregator per core:          0.00843048 GB
0:   shared_pool current per core: 4.76837e-07 GB
0:   shared_pool max per core:     13.7683 GB
0:   free per locale:              33.2877 GB
0:   free per core:                33.2877 GB
0: -------------------------
bholt commented 10 years ago

That's more than enough space for a scale 24 kronecker graph -- ~4 GB for the edges (TupleGraph) and then about the same again required for graph construction.

Could be an issue where we're counting on >1 cores. Does it work with 2 cores?

You could also set the environment variable GRAPPA_FREEZE_ON_ERROR=1, then attach to the segfaulting process in gdb:

0: PC: @     0x7ffff6ffd6ed google::DumpStackTrace()
0:     @           0x43e1e3 Grappa::impl::failure_sighandler()
0:     @       0x3c9bc0f710 (unknown)
0:     @           0x432cd8 _ZZN6Grappa8delegate4readILNS_8SyncModeE0EXadL_ZNS_4impl9local_gceEEElEET1_13GlobalAddressIS4_EENKUlvE_clEv
0:     @           0x4370a8 _ZN6Grappa4impl4callIZNS_8delegate4readILNS_8SyncModeE0EXadL_ZNS0_9local_gceEEElEET1_13GlobalAddressIS5_EEUlvE_lEET0_sT_MSA_KFS9_vE
0:     @           0x432864 _ZN6Grappa4implL18task_functor_proxyIZNS_3runIZ4mainEUlvE_EEvT_EUlvE_EEvmmm
0:     @           0x45d953 Grappa::impl::workerLoop()
0:     @           0x45a0be Grappa::impl::tramp()
0:     @           0x45f0dd (unknown)
0: I0711 09:05:59.360354 19184 Grappa.cpp:184] n01:19184 freezing for debugger. Set freeze_flag=false to continue.
# (press control-z)
^Z
[1]+  Stopped                 grappa_run -f -v -n2 -p2 -- applications/demos/hello_world.exe
# get the node and process id from the "n01:19184 freezing for debugger" line.
❯ ssh -t n01 gdb attach 19184

This would let you find the precise line number of the failure, etc.

bholt commented 10 years ago

Unfortunately we don't have a great way to dump stack traces for lambdas... I'd like to link against gdb/lldb and get full, legit backtraces, but never have the time.

alexfrolov commented 10 years ago

[a little bit offtopic] which version of slurm do you use? Mine [slurm 14.11.0-pre1] seems to have problems with --task-prolog option.

bholt commented 10 years ago

Haha, apparently our 2 main clusters are on 2.3.3 and 2.4.3. Although it looks to me like it's still an option, at least on the SchedMD website. Wonder what's going on.

alexfrolov commented 10 years ago

Yeah, 2.3.3 and 2.4.3 is still wide used. I have found my problem anyway. Slurm is ok.

On Tue, Jul 15, 2014 at 7:45 PM, Brandon Holt notifications@github.com wrote:

Haha, apparently our 2 main clusters are on 2.3.3 and 2.4.3. Although it looks to me like it's still an option, at least on the SchedMD website http://slurm.schedmd.com/prolog_epilog.html. Wonder what's going on.

— Reply to this email directly or view it on GitHub https://github.com/uwsampa/grappa/issues/176#issuecomment-49051492.

alexfrolov commented 10 years ago
#0  0x00007f003708dc0d in nanosleep () from /lib64/libc.so.6
#1  0x00007f003708da2c in sleep () from /lib64/libc.so.6
#2  0x000000000049b5ba in Grappa::impl::freeze_for_debugger() () at /home/frolo/grappa.master/system/Grappa.cpp:190
#3  0x000000000049b6e9 in Grappa::impl::failure_sighandler(int) () at /home/frolo/grappa.master/system/Grappa.cpp:210
#4  <signal handler called>
#5  std::__detail::_List_node_base::_M_unhook() () at /home/frolo/gcc-4.8.2/libstdc++-v3/src/c++98/list.cc:139
#6  0x0000000000482a98 in Allocator::remove_from_free_list(std::_Rb_tree_iterator<std::pair<long const, AllocatorChunk> > const&) ()
    at /local/usr/local/include/c++/4.8.2/bits/stl_list.h:1570
#7  0x0000000000482f35 in Allocator::malloc(unsigned long) () at /home/frolo/grappa.master/system/Allocator.hpp:240
#8  0x00000000004885bf in GlobalAddress<void> Grappa::impl::call<GlobalAllocator::remote_malloc(unsigned long)::{lambda()#1}, GlobalAddress<void> >(short, GlobalAllocator::remote_malloc(unsigned long)::{lambda()#1}, GlobalAddress<void> (GlobalAddress<void>::*)() const) () at /home/frolo/grappa.master/system/GlobalAllocator.hpp:48
#9  0x00000000004c518f in GlobalAddress<Grappa::TupleGraph::Edge> Grappa::global_alloc<Grappa::TupleGraph::Edge>(unsigned long) ()
    at /home/frolo/grappa.master/system/DelegateBase.hpp:162
#10 0x00000000004c0a21 in Grappa::TupleGraph::load_generic(std::string, void (*)(char const*, Grappa::TupleGraph::Edge*, Grappa::TupleGraph::Edge*)) ()
    at /home/frolo/grappa.master/system/graph/TupleGraph.hpp:82
#11 0x00000000004c3a31 in Grappa::TupleGraph::Load(std::string, std::string) () at /home/frolo/grappa.master/system/graph/TupleGraph.cpp:689
#12 0x0000000000477f28 in main::{lambda()#1}::operator() () at /home/frolo/grappa.master/applications/graphlab/bfs.cpp:97
#13 0x000000000047818e in void Grappa::impl::task_functor_proxy<void Grappa::run<main::{lambda()#1}>(main::{lambda()#1})::{lambda()#1}>(unsigned long, unsigned long, unsigned long)
    () at /home/frolo/grappa.master/system/Tasking.hpp:201
#14 0x00000000004baee3 in Grappa::impl::workerLoop(Grappa::Worker*, void*) () at /home/frolo/grappa.master/system/tasks/Task.hpp:82
#15 0x00000000004b75ee in Grappa::impl::tramp(Grappa::Worker*, void*) () at /home/frolo/grappa.master/system/Worker.hpp:226
#16 0x00000000004bca3c in _makestack () at /home/frolo/grappa.master/system/stack.S:208
#17 0x0000000000000000 in ?? ()

I will dig deeper into this asap.

alexfrolov commented 10 years ago

Hm... gdb does not see local variables when I switch to the specified frame...

(gdb) frame 6
#6  0x00000000004aecbd in GlobalAllocator::remote_malloc(unsigned long)::{lambda()#1}::operator()() const () at /home/frolo/grappa.master/system/GlobalAllocator.hpp:48
48      intptr_t address = reinterpret_cast< intptr_t >( a_p_->malloc( size ) );
(gdb) list
43    boost::scoped_ptr< Allocator > a_p_;
44  
45    /// allocate some number of bytes from local heap
46    /// (should be called only on node responsible for allocator)
47    GlobalAddress< void > local_malloc( size_t size ) {
48      intptr_t address = reinterpret_cast< intptr_t >( a_p_->malloc( size ) );
49      GlobalAddress< void > ga = GlobalAddress< void >::Raw( address );
50      return ga;
51    }
52  
(gdb) p size
No symbol "size" in current context.
(gdb) 

The same true for Make+Debug...

bholt commented 10 years ago

huh. that's worked for me in the past.

alexfrolov commented 10 years ago

And it works for me in the present (I used old gdb for newly build gcc) :-)

Returning to segfault. I have been digging it for a while and found the following:

  1. Allocator (locale_shared_memory) allocates just 5.9 GB which is 0.15 of 39 GB (SHMMAX) according to global_heap_fraction option.
  2. Allocator tries to malloc 8GB and segfault as free_lists_.lower_bound( allocation_size ) returns end() which is dereferenced later leading to segfault.

I think that special checks must be added to Allocator::malloc for return value of free_lists_.lower_bound( allocation_size ).

bholt commented 10 years ago

@nelsonje would know better about this allocator code. I've never understood why it sometimes seems to be out of space with so much remaining. Though usually it doesn't error with this large a fraction free. Are you saying it may just need to check the return value and retry?

This Boost SysV shared-memory allocator has been endlessly frustrating for us, we'd love to get rid of it, but it requires a serious rewrite.

nelsonje commented 10 years ago

@bholt, be careful not to confuse the LocaleSharedMemory allocator with the global heap allocator. The LocaleSharedMemory one is the one that gives us the "allocation of X failed with Y free" messages. The error that's happening here is in the global heap allocator. It still has its packing problems (since it somewhat stupidly requires the allocation of power-of-two sized chunks) but at least they make sense given the data structure layout.

This segfault is just the global heap allocator saying it ran out of space. I'll add a check here that makes that clear.

In the current design, all memory in the global heap is allocated when the job starts and can't be increased as the job runs. So in this case you've only got 5.9 GB of global heap. Since you have a bunch of free space, you can increase --global_heap_fraction to move more of that into the global heap. Try 0.5.

alexfrolov commented 10 years ago

@bholt and @nelsonje , thank you!
That check will help to omit some confusion with large graphs.

I just moved a little step further and loading graph.kronecker.scale-26.bintsv4 (~24GB for TupleGraph). I reconfigured Grappa to use ~60GB of memory (out of 80 GB total) and adjusted --global_heap_fraction=0.65 --shared_pool_memory_fraction=0.25.

But after graph is loaded I receive bad_alloc exception which is difficult to trace cause grappa wont freeze after it and just core dumps.

0: I0721 20:06:02.951681 16723 Grappa.cpp:262] memory registration disabled
0: I0721 20:06:03.148797 16723 Grappa.cpp:343] Communicator initialized.
0: I0721 20:06:03.148833 16723 Grappa.cpp:350] Aggregator initialized.
0: I0721 20:06:03.148847 16723 Grappa.cpp:407] nnode: 1, ppn: 1, iBs/node: 35.3544, total_iBs: 35.3544
0: I0721 20:06:03.148876 16723 Grappa.cpp:419] global_memory_size_bytes = 43928330240
0: I0721 20:06:03.148882 16723 Grappa.cpp:420] global_bytes_per_core = 43928330240
0: I0721 20:06:03.148887 16723 Grappa.cpp:421] global_bytes_per_locale = 43928330240
0: I0721 20:06:03.148895 16723 Grappa.cpp:439] Initializing tasking layer. num_starting_workers=512
0: I0721 20:06:03.148922 16723 Grappa.cpp:444] Scheduler initialized.
0: I0721 20:06:03.148929 16723 Grappa.cpp:449] RDMA aggregator initialized.
0: I0721 20:06:03.148941 16723 LocaleSharedMemory.cpp:67] Creating LocaleSharedMemory region GrappaLocaleSharedMemory with 67582050304 bytes on 0 of 1
0: I0721 20:06:03.149026 16723 LocaleSharedMemory.cpp:93] Created LocaleSharedMemory region GrappaLocaleSharedMemory with 67582050304 bytes on 0 of 1
0: I0721 20:06:03.155120 16723 Grappa.cpp:510] 
0: -------------------------
0: Shared memory breakdown:
0:   locale shared heap total:     62.9407 GB
0:   locale shared heap per core:  62.9407 GB
0:   communicator per core:        0.125 GB
0:   tasks per core:               0.0161209 GB
0:   global heap per core:         40.9114 GB
0:   aggregator per core:          0.00843048 GB
0:   shared_pool current per core: 4.76837e-07 GB
0:   shared_pool max per core:     15.7352 GB
0:   free per locale:              21.8797 GB
0:   free per core:                21.8797 GB
0: -------------------------
0: I0721 20:06:03.291826 16723 bfs.cpp:96] loading graph.kronecker.scale-26.bintsv4
0: I0721 20:06:03.291875 16723 TupleGraph.cpp:52] LOG(INFO) << load_generic called
0: I0721 20:06:03.291882 16723 TupleGraph.cpp:53] FLAGS_v = 3
0: I0721 20:06:18.277159 16723 bfs.cpp:98] done! loaded 1073741824 edges
0: I0721 20:06:18.277222 16723 Graph.hpp:611] Graph: undirected
0: I0721 20:06:46.470979 16723 Graph.hpp:624] find_nv_time: 28.1937
0: I0721 20:15:49.083591 16723 Graph.hpp:663] count_time: 540.488
0: terminate called after throwing an instance of 'std::bad_alloc'
0:   what():  std::bad_alloc
0: [A11:16723] *** Process received signal ***
0: [A11:16723] Signal: Aborted (6)
0: [A11:16723] Signal code:  (-6)
0: [A11:16723] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7fa4221717c0]
0: [A11:16723] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7fa421478b55]
0: [A11:16723] [ 2] /lib64/libc.so.6(abort+0x181)[0x7fa42147a131]
0: [A11:16723] [ 3] /local/usr/local/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x155)[0x7fa421caf9a5]
0: [A11:16723] [ 4] /local/usr/local/lib64/libstdc++.so.6(+0x63b16)[0x7fa421cadb16]
0: [A11:16723] [ 5] /local/usr/local/lib64/libstdc++.so.6(+0x63b43)[0x7fa421cadb43]
0: [A11:16723] [ 6] /local/usr/local/lib64/libstdc++.so.6(+0x63d6e)[0x7fa421cadd6e]
0: [A11:16723] [ 7] /local/usr/local/lib64/libstdc++.so.6(_Znwm+0x7d)[0x7fa421cae26d]
0: [A11:16723] [ 8] /local/usr/local/lib64/libstdc++.so.6(_Znam+0x9)[0x7fa421cae309]
0: [A11:16723] [ 9] /home/frolo/grappa.master/build/Make+Release/./applications/graphlab/bfs.exe(_ZN6Grappa4impl18loop_decompositionILNS_8TaskModeE0EXadL_ZNS0_9local_gceEEELl0EZNS0_11forall_hereILS2_0ELNS_8SyncModeE1EXadL_ZNS0_9local_gceEEELl0EZZNS0_6forallILS2_0ELS4_0EXadL_ZNS0_9local_gceEEELl0ENS0_6VertexI13BFSVertexDataNS_5EmptyELb0EEEZNS0_6forallILS2_0ELS4_0EXadL_ZNS0_9local_gceEEELl0ES9_ZNS_5GraphIS7_S8_E6createERKNS_10TupleGraphEbbEUllRS9_E3_EEv13GlobalAddressIT3_ElT4_MSL_KFvlRSJ_EEUlllPS9_E_EEvSK_lSL_MSL_KFvllPSJ_EENKUlSP_mE_clESP_mEUlllE_EEvllSJ_MSJ_KFvllEEUlllE0_EEvllT2_+0x155)[0x491f25]
0: [A11:16723] [10] /home/frolo/grappa.master/build/Make+Release/./applications/graphlab/bfs.exe(_ZN6Grappa4impl18loop_decompositionILNS_8TaskModeE0EXadL_ZNS0_9local_gceEEELl0EZNS0_11forall_hereILS2_0ELNS_8SyncModeE1EXadL_ZNS0_9local_gceEEELl0EZZNS0_6forallILS2_0ELS4_0EXadL_ZNS0_9local_gceEEELl0ENS0_6VertexI13BFSVertexDataNS_5EmptyELb0EEEZNS0_6forallILS2_0ELS4_0EXadL_ZNS0_9local_gceEEELl0ES9_ZNS_5GraphIS7_S8_E6createERKNS_10TupleGraphEbbEUllRS9_E3_EEv13GlobalAddressIT3_ElT4_MSL_KFvlRSJ_EEUlllPS9_E_EEvSK_lSL_MSL_KFvllPSJ_EENKUlSP_mE_clESP_mEUlllE_EEvllSJ_MSJ_KFvllEEUlllE0_EEvllT2_+0xd2)[0x491ea2]
0: [A11:16723] [11] /home/frolo/grappa.master/build/Make+Release/./applications/graphlab/bfs.exe[0x47a02e]
0: [A11:16723] [12] /home/frolo/grappa.master/build/Make+Release/./applications/graphlab/bfs.exe(_ZN6Grappa4impl10workerLoopEPNS_6WorkerEPv+0x43)[0x4bb4a3]
0: [A11:16723] [13] /home/frolo/grappa.master/build/Make+Release/./applications/graphlab/bfs.exe[0x4b7c0e]
0: [A11:16723] [14] /home/frolo/grappa.master/build/Make+Release/./applications/graphlab/bfs.exe[0x4bcffc]
0: [A11:16723] *** End of error message ***
srun: error: A11: task 0: Aborted (core dumped)
srun: Terminating job step 415.0

I tried to use the core dump to backtrace but also without much success:

(gdb) bt
#0  0x00007fab11874b55 in raise () from /lib64/libc.so.6
Cannot access memory at address 0x400a4733ae88

What could trigger bad_alloc? What is the best way to run Grappa under gdb (may be call failure_function explicitly)?

alexfrolov commented 10 years ago

for ppn=8 the crash looks like this:

0: I0721 21:03:10.570602 20642 Graph.hpp:731] nadj_local = 261098057
3: I0721 21:03:10.570572 20645 Graph.hpp:731] nadj_local = 264137981
5: I0721 21:03:10.570574 20647 Graph.hpp:731] nadj_local = 262176931
6: I0721 21:03:10.570577 20648 Graph.hpp:731] nadj_local = 263015086
1: I0721 21:03:10.570606 20643 Graph.hpp:731] nadj_local = 263483124
2: I0721 21:03:10.570632 20644 Graph.hpp:731] nadj_local = 262566644
4: I0721 21:03:10.570639 20646 Graph.hpp:731] nadj_local = 263582068
7: I0721 21:03:10.570646 20649 Graph.hpp:731] nadj_local = 263782057
0: [A11:20642] *** Process received signal ***
0: [A11:20642] Signal: Bus error (7)
0: [A11:20642] Signal code: Non-existant physical address (2)
0: [A11:20642] Failing at address: 0x400f28944000
1: [A11:20643] *** Process received signal ***
1: [A11:20643] Signal: Bus error (7)
1: [A11:20643] Signal code: Non-existant physical address (2)
1: [A11:20643] Failing at address: 0x400bd25d8000
2: [A11:20644] *** Process received signal ***
2: [A11:20644] Signal: Bus error (7)
2: [A11:20644] Signal code: Non-existant physical address (2)
2: [A11:20644] Failing at address: 0x400d7a2b0000
3: [A11:20645] *** Process received signal ***
3: [A11:20645] Signal: Bus error (7)
3: [A11:20645] Signal code: Non-existant physical address (2)
3: [A11:20645] Failing at address: 0x400b552be000
4: [A11:20646] *** Process received signal ***
4: [A11:20646] Signal: Bus error (7)
4: [A11:20646] Signal code: Non-existant physical address (2)
4: [A11:20646] Failing at address: 0x400ea3f66000
5: [A11:20647] *** Process received signal ***
5: [A11:20647] Signal: Bus error (7)
5: [A11:20647] Signal code: Non-existant physical address (2)
5: [A11:20647] Failing at address: 0x400e1697f000
6: [A11:20648] *** Process received signal ***
6: [A11:20648] Signal: Bus error (7)
6: [A11:20648] Signal code: Non-existant physical address (2)
6: [A11:20648] Failing at address: 0x400cdd2bf000
7: [A11:20649] *** Process received signal ***
7: [A11:20649] Signal: Bus error (7)
7: [A11:20649] Signal code: Non-existant physical address (2)
7: [A11:20649] Failing at address: 0x400c50409000
nelsonje commented 10 years ago

For debugging: @bholt mentioned the GRAPPA_FREEZE_ON_ERROR environment variable; you can also set the environment variable GRAPPA_FREEZE=1 and Grappa will freeze unconditionally before it starts running user code. You can then attach GDB to each process in the job, clear the freeze_flag, and continue.

I'm not sure where this bad_alloc is coming from. If you run one of the mangled names in the backtrace through c++filt, does it produce anything more readable?

nelsonje commented 10 years ago

Your latest backtrace looks like a GlobalAddress is being dereferenced as a local address. It looks like these failures are all happening in the graph creation step. We'll see if we can recreate the issue, but if you can get a more specific backtrace it'd be super-useful.

alexfrolov commented 10 years ago

Yes, I understood @bholt, but that approach does not work with this case because grappa is not freezed when bad_alloc occurs. Demangling does not provide any meaningful names.

nelsonje commented 10 years ago

That comment was intended to point out the GRAPPA_FREEZE flag, which gives you an opportunity to attach GDB before any user code starts running. GDB may be able to give you a backtrace with the bad_alloc.

alexfrolov commented 10 years ago

That is a backtrace after bad_alloc:

Program received signal SIGABRT, Aborted.
0x00007f5a866b7b55 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f5a866b7b55 in raise () from /lib64/libc.so.6
#1  0x00007f5a866b9131 in abort () from /lib64/libc.so.6
#2  0x00007f5a86eee9a5 in __gnu_cxx::__verbose_terminate_handler ()
    at /home/frolo/gcc-4.8.2/libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007f5a86eecb16 in __cxxabiv1::__terminate (handler=<optimized out>)
    at /home/frolo/gcc-4.8.2/libstdc++-v3/libsupc++/eh_terminate.cc:38
#4  0x00007f5a86eecb43 in std::terminate ()
    at /home/frolo/gcc-4.8.2/libstdc++-v3/libsupc++/eh_terminate.cc:48
#5  0x00007f5a86eecd6e in __cxxabiv1::__cxa_throw (obj=0x2556e9bc0, 
    tinfo=<optimized out>, dest=<optimized out>)
    at /home/frolo/gcc-4.8.2/libstdc++-v3/libsupc++/eh_throw.cc:84
#6  0x00007f5a86eed26d in operator new (sz=1512)
    at /home/frolo/gcc-4.8.2/libstdc++-v3/libsupc++/new_op.cc:56
#7  0x00007f5a86eed309 in operator new[] (sz=<optimized out>)
    at /home/frolo/gcc-4.8.2/libstdc++-v3/libsupc++/new_opv.cc:32
#8  0x0000000000491f25 in operator() (v=..., i=<optimized out>, 
    __closure=0xaa1640)
    at /home/frolo/grappa.master/system/graph/Graph.hpp:691
#9  operator() (first=<optimized out>, niters=1024, start=<optimized out>, 
    __closure=0xaa1640)
    at /home/frolo/grappa.master/system/ParallelLoop.hpp:423
#10 operator() (n=1024, s=<optimized out>, __closure=0xaa1640)
---Type <return> to continue, or q <return> to quit--- 
    at /home/frolo/grappa.master/system/ParallelLoop.hpp:399
#11 operator() (n=1024, s=<optimized out>, __closure=<synthetic pointer>)
    at /home/frolo/grappa.master/system/ParallelLoop.hpp:151
#12 _ZN6Grappa4impl18loop_decompositionILNS_8TaskModeE0EXadL_ZNS0_9local_gceEEELl0EZNS0_11forall_hereILS2_0ELNS_8SyncModeE1EXadL_ZNS0_9local_gceEEELl0EZZNS0_6forallILS2_0ELS4_0EXadL_ZNS0_9local_gceEEELl0ENS0_6VertexI13BFSVertexDataNS_5EmptyELb0EEEZNS0_6forallILS2_0ELS4_0EXadL_ZNS0_9local_gceEEELl0ES9_ZNS_5GraphIS7_S8_E6createERKNS_10TupleGraphEbbEUllRS9_E3_EEv13GlobalAddressIT3_ElT4_MSL_KFvlRSJ_EEUlllPS9_E_EEvSK_lSL_MSL_KFvllPSJ_EENKUlSP_mE_clESP_mEUlllE_EEvllSJ_MSJ_KFvllEEUlllE0_EEvllT2_ (start=<optimized out>, iterations=iterations@entry=1024, 
    loop_body=...) at /home/frolo/grappa.master/system/ParallelLoop.hpp:86
#13 0x000000000047a02e in operator() (__closure=0x400a4b7ccfd0)
    at /home/frolo/grappa.master/system/ParallelLoop.hpp:99
#14 Grappa::impl::task_functor_proxy<Grappa::impl::loop_decomposition(int64_t, int64_t, F) [with Grappa::TaskMode B = (Grappa::TaskMode)0; Grappa::GlobalCompletionEvent* C = (& Grappa::impl::local_gce); long int Threshold = 0l; F = Grappa::impl::forall_here(int64_t, int64_t, F, void (F::*)(int64_t, int64_t)const) [with Grappa::TaskMode B = (Grappa::TaskMode)0; Grappa::SyncMode S = (Grappa::SyncMode)1; Grappa::GlobalCompletionEvent* C = (& Grappa::impl::local_gce); long int Threshold = 0l; F = Grappa::impl::forall(GlobalAddress<T>, int64_t, F, void (F::*)(int64_t, int64_t, T*)const) [with Grappa::TaskMode B = (Grappa::TaskMode)0; Grappa::SyncMode S = (Grappa::SyncMode)0; Grappa::GlobalCompletionEvent* G---Type <return> to continue, or q <return> to quit---
CE = (& Grappa::impl::local_gce); long int Threshold = 0l; T = Grappa::impl::Vertex<BFSVertexData, Grappa::Empty, false>; F = Grappa::impl::forall(GlobalAddress<T>, int64_t, F, void (F::*)(int64_t, T&)const) [with Grappa::TaskMode B = (Grappa::TaskMode)0; Grappa::SyncMode S = (Grappa::SyncMode)0; Grappa::GlobalCompletionEvent* GCE = (& Grappa::impl::local_gce); long int Threshold = 0l; T = Grappa::impl::Vertex<BFSVertexData, Grappa::Empty, false>; F = Grappa::Graph<V, E>::create(const Grappa::TupleGraph&, bool, bool) [with V = BFSVertexData; E = Grappa::Empty]::__lambda110; int64_t = long int]::__lambda74; int64_t = long int]::__lambda72::__lambda73; int64_t = long int]::__lambda64; int64_t = long int]::__lambda62>(uint64_t, uint64_t, uint64_t) (a0=<optimized out>, a1=1024, 
    a2=<optimized out>) at /home/frolo/grappa.master/system/Tasking.hpp:79
#15 0x00000000004bb4a3 in execute (this=0x400a4b7cd010)
    at /home/frolo/grappa.master/system/tasks/Task.hpp:82
#16 Grappa::impl::workerLoop (me=me@entry=0xaaf6a0, args=args@entry=0xa81bc0)
    at /home/frolo/grappa.master/system/tasks/TaskingScheduler.cpp:172
#17 0x00000000004b7c0e in Grappa::impl::tramp (me=0xaaf6a0, arg=0xa80680)
    at /home/frolo/grappa.master/system/Worker.hpp:226
#18 0x00000000004bcffc in _makestack ()
    at /home/frolo/grappa.master/system/stack.S:208
#19 0x0000000000000000 in ?? ()

It seems that operator new has thrown the bad_alloc exception which can be caused by full exhausting of the main memory. Please correct me if I am wrong here. However, scale-26 is not so big graph taking into account that there is 80 GB of main memory.

So current shared memory breakdown is as following:

0: Shared memory breakdown:
0:   locale shared heap total:     62.9407 GB
0:   locale shared heap per core:  62.9407 GB
0:   communicator per core:        0.125 GB
0:   tasks per core:               0.0161209 GB
0:   global heap per core:         40.9114 GB
0:   aggregator per core:          0.00843048 GB
0:   shared_pool current per core: 4.76837e-07 GB
0:   shared_pool max per core:     15.7352 GB
0:   free per locale:              21.8797 GB
0:   free per core:                21.8797 GB

Size of TupleGraph is 24GB and it is allocated in "global heap per core" (afaik). Where are graph vertices are allocated? What "free per locale" memory is used for?

Another question: Does the allocator (Allocator) support chunk coalescing to allocate chunk with size greater than maximum of available chunks?

Another question 2: Why did you need to implement your own allocator for global heap? Is it more effective than one boost::interprocess::segment uses?

Another question 3: What is the principal difference between "global heap" and "locale shared heap"?

alexfrolov commented 10 years ago

Just have tried to decrease SHMMAX to 50 GB and got:

Allocation of 16830735584 bytes failed with 10303645536 free and 43383432896 allocated locally

It seems that graph with scale 26 needs more memory than 80 GB for single node :-(

bholt commented 10 years ago

Scale 26 should be able to be constructed in that much memory, but it's tricky because there are several different competing memory pools.

Out of the "locale shared heap total", which is set by SHMMAX, we pre-allocate a fraction for global heap, which we had to do in order to get the linear addressing right on all the nodes (or something). Then we also allocate all the tasks, communicator buffers, shared_pools, from that too because those are all things that may have to be shared among cores in a node.

In Graph::create() we allocate the adjacency lists out of the locale-shared heap: here and here. It's possible that you're getting bit by this double-allocation to let us compact the graph.

alexfrolov commented 10 years ago

I have tried more combinations of shmmax and -global_heap_fraction without much success. It seems that shmmax can not be greater than 50 GB since of the bad_alloc exception. While 50 GB is not enough for global heap allocation requests (Allocator::malloc) and locale_alloc requests. Overall it seems that Graph::create is too memory hungry and may be there is some sense to rewrite it to use less duplication of memory (adjacency list).

bholt commented 10 years ago

Are you saying you think there's a hard limit somewhere that breaks with SHMMAX > 50 GB? Does it work better if you divide that among more cores? (i.e. more than --ppn=1?

Also, feel free to modify Graph::create() to suit your purposes. My suggestion would be to add an option to de-allocate the TupleGraph after creating the per-vertex edgelists, then allocate the compacted vertex and edge data.

alexfrolov commented 10 years ago

I mean that Grappa needs approx. 20 GB besides what is allocated to by SHMMAX (50 GB). Total 70 GB seems to be the limit for virtual memory allocation and bad_alloc occurs in one of new operators. My latest try:

0: -------------------------
0: Shared memory breakdown:
0:   locale shared heap total:     55 GB
0:   locale shared heap per core:  55 GB
0:   communicator per core:        0.125 GB
0:   tasks per core:               0.0161209 GB
0:   global heap per core:         36.575 GB
0:   aggregator per core:          0.00843048 GB
0:   shared_pool current per core: 4.76837e-07 GB
0:   shared_pool max per core:     2.75 GB
0:   free per locale:              18.2754 GB
0:   free per core:                18.2754 GB
0: -------------------------

... failed with bad_alloc. My observations: 36.5 gb -- is required minimum for global heap and 18 gb is minimum for locale_alloc.

Also, feel free to modify Graph::create() to suit your purposes. My suggestion would be to add an option to de-allocate the TupleGraph after creating the per-vertex edgelists, then allocate the compacted vertex and edge data.

This is what I would like to suggest, but that is not enough because TupleGraph is allocated in "global shared heap". When it is destroyed this memory is still unreachable for locale_alloc (used for compact vertex and edge data) or new. And the problem is still there. One needs more flexibility in distribution of shared memory between allocator and locale_alloc.

nelsonje commented 10 years ago

The current global/local shared memory design is an artifact of a time when we expected to do fine-grained allocations and use RDMA and huge pages extensively. Since these aren't true any more, we're planning to make a few changes in the future:

This should give us a lot more flexibility in memory allocation.

alexfrolov commented 10 years ago

Just interested what's wrong with "RDMA and fine-grained allocation with huge pages". Is it a question of performance?

add global array allocation with flexible distributions allocated outside of the current global shared heap.

If I understood you wright there will be additional allocator used for global arrays?

On Wed, Jul 23, 2014 at 8:40 PM, Jacob Nelson notifications@github.com wrote:

The current global/local shared memory design is an artifact of a time when we expected to do fine-grained allocations and use RDMA and huge pages extensively. Since these aren't true any more, we're planning to make a few changes in the future:

  • change messaging layer to remove requirement of allocating messages from locale shared heap
  • move to a multithreaded model on nodes to allow sharing without using POSIX shared memory
  • add global array allocation with flexible distributions allocated outside of the current global shared heap.

This should give us a lot more flexibility in memory allocation.

— Reply to this email directly or view it on GitHub https://github.com/uwsampa/grappa/issues/176#issuecomment-49900381.

alexfrolov commented 10 years ago

It seems that I have another issue in Graph::create(). When I run (graphlab-)bfs on nnode=2 and ppn=4 and graph with scale=25 Grappa hangs on here.

I suggest that it can be somehow related to these warnings:

5: W0728 17:19:05.567507  8675 ChunkAllocator.cpp:201] Shared pool size 2952792960 exceeded max size 2952790016
3: W0728 17:19:05.921279 12024 ChunkAllocator.cpp:201] Shared pool size 2952792960 exceeded max size 2952790016
2: W0728 17:19:07.170680 12023 ChunkAllocator.cpp:201] Shared pool size 2952792960 exceeded max size 2952790016
6: W0728 17:19:07.604207  8676 ChunkAllocator.cpp:201] Shared pool size 2952792960 exceeded max size 2952790016
7: W0728 17:19:08.730379  8677 ChunkAllocator.cpp:201] Shared pool size 2952792960 exceeded max size 2952790016
0: W0728 17:19:09.364325 12021 ChunkAllocator.cpp:201] Shared pool size 2952792832 exceeded max size 2952790016
4: W0728 17:19:36.774658  8674 ChunkAllocator.cpp:201] Shared pool size 2952792960 exceeded max size 2952790016
bholt commented 10 years ago

As suggested by those messages, that has to do with the amount of memory the shared message pool has been allowed to allocate. Those messages don't actually mean it's failed, just that it's growing the size of the pool used for heap-allocated messages -- we often get those the first time we do a bunch of parallel work.

In the past we've had some problems with the shared pool causing the locale-shared allocator to freak out. If that's the case, flags shared_pool_max_size and shared_pool_memory_fraction can be adjusted to try to fix that.