uwsampa / grappa

Grappa: scaling irregular applications on commodity clusters
grappa.io
BSD 3-Clause "New" or "Revised" License
159 stars 51 forks source link

Running BFS with big(er) graphs #283

Open stnot opened 8 years ago

stnot commented 8 years ago

I tried starting grappa's bfs implementation on 4 nodes, each equipped with 64GB of ram and one Intel Xeon CPU with 6 cores (two threads per core). I used slurm to start the job and want to load a scale 26 graph which was generated by the graph500 graph generator (total file size ~16GB). Running it with "srun --nodes=4 --ntasks-per-node=12 -- ./bfs_beamer.exe --path=/some/path/rmat_26_16" throws the following error: "Out of memory in the global heap: couldn't find a chunk of size 34359738368 to hold an allocation of 34359738368 bytes. Can you increase --global_heap_fraction?" I changed the --global_heap_fraction to various values (0.1 to 1.0) but none of them worked. I have found your notes on setting the memory sizes (https://github.com/uwsampa/grappa/wiki/Setting-memory-region-sizes) but that didn't help me figuring out any working combination of settings. Are there any other settings I have to change or have i hit a limit with the given setup? For evaluation purpose, I have to find out what's the biggest synthetic graph I can load with the given nodes using grappa.

Full output of the srun command without --global_heap_fraction option set: http://pastebin.com/6GrCbPdN

nelsonje commented 8 years ago

Try setting --locale_shared_fraction to 0.85 or so. This determines how much memory is available for the global heap (and other specialty allocators), and defaults to 0.5. It probably can't be larger than 0.9, since some memory must be reserved for the non-Grappa-controlled heap as well. (Looks like we need to update that wiki page to match the current state of the code.)

I'll bet it will work if you just set that flag; if you still have problems you can increase --global_heap_fraction too. You probably don't want to go above 0.5 for that one.

stnot commented 8 years ago

I ran the application with the parameters and was able to execute the bfs implementation with a 2^23 scale graph. However, a 2^24 scale graph seems to crash grappa: http://pastebin.com/MgkheB8J

Executed with: srun --nodes=4 --ntasks-per-node=12 -- ./bfs_beamer.exe --path=/root/kron_24_16.bedges --locale_shared_fraction 0.85 --global_heap_fraction 0.5

Same setup, the 2^24 graph was generated using graph500's kronecker generator (latest version from github). I tried different values with local_shared_fraction and global_heap_fraction but as you already indicated this causes issues with other things not having enough memory. Is there anything else I have to consider? Any further parameters to tweak?

bholt commented 8 years ago

The strange thing is that for some reason it's not dividing up the memory among the cores on the node — there's 53 GB total for all the cores, but then it's not recognizing that you're running with --ntasks-per-node=12. It should be dividing that memory up among the 12 cores. So my guess would be that as soon as it tries to use some of that extra memory that doesn't exist, it chokes. There's some logic in grappa that should be querying MPI for the cores-per-node and all that, and apparently it's not working correctly in your case. Probably @nelsonje has a better idea what's up, but you could try looking for that code in Communicator.cpp (I think) and see if there's something different about your MPI setup.

bmyerz commented 8 years ago

@stnot What is the exact version of MPI you are using? @bholt 's comment made me remember that some versions of OpenMPI, e.g., had partial MPI 3 support, which didn't include the API we use for getting cores-per-node.

bmyerz commented 8 years ago

https://github.com/uwsampa/grappa/blob/master/system/Communicator.cpp#L141

This may be the line, and I have a workaround somewhere, although if this is in fact your problem I'd suggest trying a newer MPI first.

nelsonje commented 8 years ago

Ah, I didn't notice that in the log. It's more likely to be a srun/MPI mismatch than the problem @bmyerz suggests. You may use MPI on this machine enough to already know the answer to the question the following is intended to answer; but if not, try

make demo-hello_world
srun --nodes=2 --ntasks-per-node=3 applications/demos/hello_world.exe

If you see output like

I0601 10:50:47.035892 16751 hello_world.cpp:45] Hello world from locale 1 core 3
I0601 10:50:47.036644 29662 hello_world.cpp:45] Hello world from locale 0 core 0
I0601 10:50:47.035893 16752 hello_world.cpp:45] Hello world from locale 1 core 4
I0601 10:50:47.035895 16753 hello_world.cpp:45] Hello world from locale 1 core 5
I0601 10:50:47.036659 29663 hello_world.cpp:45] Hello world from locale 0 core 1
I0601 10:50:47.036648 29664 hello_world.cpp:45] Hello world from locale 0 core 2

life is good and we'll have to look elsewhere. If you see output like:

I0601 10:50:47.035892 16751 hello_world.cpp:45] Hello world from locale 0 core 0
I0601 10:50:47.036644 29662 hello_world.cpp:45] Hello world from locale 0 core 0
I0601 10:50:47.035893 16752 hello_world.cpp:45] Hello world from locale 0 core 0
I0601 10:50:47.035895 16753 hello_world.cpp:45] Hello world from locale 0 core 0
I0601 10:50:47.036659 29663 hello_world.cpp:45] Hello world from locale 0 core 0
I0601 10:50:47.036648 29664 hello_world.cpp:45] Hello world from locale 0 core 0

then your srun and MPI installation are not configured to communicate correctly, and you'll need to run with a command like salloc --nodes=2 --ntasks-per-node=3 -- mpirun applications/demos/hello_world.exe. (you may need something other than mpirun.)

nelsonje commented 8 years ago

Now that I've properly looked at the full log you provided, it looks very likely that this is the problem. You should only see one of those memory breakdowns per job; the fact that you're seeing multiple ones suggests that each process doesn't know about the rest and is trying to solve the problem independently.

stnot commented 8 years ago

Thank you for the detailed replies and help so far. I am using OpenMPI 1.10.2 which is the latest current stable from their website. Running make demo-hello_world srun --nodes=2 --ntasks-per-node=3 applications/demos/hello_world.exe printed the "bad" output, so I ran it with make demo-hello_world srun --nodes=2 --ntasks-per-node=3 applications/demos/hello_world.exe and it looks fine now. I adapted this to the bfs_beamer.exe and started it using salloc --nodes=4 --ntasks-per-node=12 -- mpirun --allow-run-as-root ./bfs_beamer.exe --path=/root/kron_24_16.bedges --locale_shared_fraction=0.85 --global_heap_fraction=0.5 The memory errors from before are gone but the warning Shared pool size 1196790400 exceeded max size 1196783752 is printed multiple times and the application won't start the actual bfs algorithm and does not terminate. I left it running for approx. 30 minutes and I don't think it does take that long if the 2^23 scale graph works fine and is done after a few minutes. Here is the full output up to manual termination: http://pastebin.com/K5PCxEgA

bmyerz commented 8 years ago

The warning you are seeing indicates that the messaging system is running out of space for pending messages.

you can provide the option --shared_pool_memory_fraction=<fraction> to change the relative size of this pool of memory for messages.

If you see the warning repeated by many cores during a run (as in your output), I would suggest killing the job, because it will probably go far too slowly.

stnot commented 8 years ago

I added --shared_pool_memory_fraction=0.5 and the 2^24 graph is working fine now. However, increasing the size throws new errors when reaching 2^26:

[node65:02847] *** Process received signal ***
[node65:02847] Signal: Bus error (7)
[node65:02847] Signal code: Non-existant physical address (2)
[node65:02847] Failing at address: 0x400955793c3f

Again, I tried different combinations of settings of the memory parameters but I am always getting this when trying a scal 2^26 graph. Running wth salloc --nodes=4 --ntasks-per-node=12 -- mpirun --allow-run-as-root ./bfs_beamer.exe --path=/root/rmat_26_16_eq.bedges --locale_shared_fraction=0.85 --shared_pool_memory_fraction=0.5, here is the full output: http://pastebin.com/wwBBRMWr

bmyerz commented 8 years ago

I think a backtrace is needed for more information. run your program with the environment variable GRAPPA_FREEZE_ON_ERROR=1. If it stalls successfully on the bus error, then you can attach gdb to one of the Grappa processes that crashed.

https://github.com/uwsampa/grappa/blob/master/doc/debugging.md

stnot commented 8 years ago

I just tried this (export GRAPPA_FREEZE_ON_ERROR=1 before execution) with both the relase and the debug build but Grappa exits right after throwing the bus error. After that, I set the freeze_flag to true in the Grappa.cpp file (right after the two if blocks with the environment variable checks) but this doesn't work either.

nelsonje commented 8 years ago

Just setting the environment variable should be enough. It didn't work in this case because you're getting a bus error, and the code in master forgot to capture that. You can pull the appropriate commit in from a dev branch with this command:

git cherry-pick c97d32f28d8d96036cea7c9185a424b0ee2a27e9

and try again.

(It's probably going to show that we're running out of memory in one of the memory pools.)

stnot commented 8 years ago

That's working, thanks. Here is a backtrace from one of the Grappa processes on frozen state: http://pastebin.com/q5tEfCQ4

guowentian commented 7 years ago

Hi, I came across a similar issues when running pagerank on a large graph(508k edges and 75k nodes, which is not very big actually). I don't understand the info printed out in the output. Shared memory breakdown: node total: 31.3127 GB locale shared heap total: 15.6564 GB locale shared heap per core: 15.6564 GB communicator per core: 0.125 GB tasks per core: 0.0156631 GB global heap per core: 9.39381 GB aggregator per core: 0.565323 GB shared_pool current per core: 4.76837e-07 GB shared_pool max per core: 3.91409 GB free per locale: 5.55655 GB free per core: 5.55655 GB

Can you explain to me each of these mean ? like, no matter how many node I use, the node total is still 31G(note that memory size of each my machine is 32G), which is the parameter I am confused. And the remaining parameters I still don't get it after I read https://github.com/uwsampa/grappa/wiki/Setting-memory-region-sizes .

shinyehtsai commented 7 years ago

I also ran into the same issure when I did pagerank. Allocator.hpp:226] Out of memory in the global heap: couldn't find a chunk of size 68719476736 to hold an allocation of 36401294848 bytes. Can you increase --global_heap_fraction?

My command is ./pagerank.exe --path

my graph only contains #vertices: 81306 #edges:1768135

Shared memory breakdown: node total: 125.689 GB locale shared heap total: 62.8444 GB locale shared heap per core: 62.8444 GB communicator per core: 0.125 GB tasks per core: 0.0156631 GB global heap per core: 15.7111 GB aggregator per core: 0.00247955 GB shared_pool current per core: 4.76837e-07 GB shared_pool max per core: 6.28444 GB free per locale: 46.9901 GB free per core: 46.9901 GB

I did try to set

  1. --shared_pool_memory_fraction to 0.5
  2. --global_heap_fraction to 0.85
  3. --locale_shared_fraction to 0.85 but all of them doesn't help. Do you have any ideas for fixing this?
shinyehtsai commented 7 years ago

I successfully run the dataset #vertices: 81306 #edges:1768135 with --global_heap_fraction 0.85 --locale_shared_fraction 0.85 (set both instead of one by one) Hope this helps others (Update 01/05/17) With these two parameters, I could complete a 11G dataset.