thomasrolinger / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
0 stars 1 forks source link

Running Experiments on zaratan #46

Closed thomasrolinger closed 1 year ago

thomasrolinger commented 1 year ago

We have access to a set of nodes with 128 cores each, 512GB of memory and connected via HDR-100 Infiniband.

There are two issues I am having right now:

Poor performance with all 128 cores

When running NAS-CG and BFS with all 128 cores on a node, the performance is much worse than using fewer cores. I’m not sure if that is because of an underlying Chapel issue, or because of my particular applications doing lots of fine grain communication. It seems that performance is best with 16 or 20 cores, which seems like a waste given all the cores we could be using. So I can’t run any large experiments until I figure out whether we can use all the cores effectively, or if not, how many cores to use instead.

Graph Generator crashes

When trying to run applications that generate graphs, we seg fault when using more than one locale. I can’t pinpoint what’s going on, but it is preventing us from testing more applications (for example, seeing whether the issue mentioned above also occurs for BFS). I can get it to work when using print statements and during a coforall to just a for, so there is definitely some memory corruption/bad things happening. There are two workarounds, one that should definitely work but is a pain, and one that may work and will be better.

One thing we can do is generate the graphs on a machine that doesn’t crash and write that out to a file, much like we do for PageRank. We then transfer those files to zaratan and read them into any application that needs them. The drawback is the time to read in the file is most likely going to be more than just generating it. Since we have limit time allocation on zaratan, this is disappointing. But it should work.

The other approach is to try to modify the graph generator so it does less code in C, as I think that is the issue. Ideally, the C code would just be used to generate the edge tuple and then everything else is in Chapel. This may or may not work, since I haven’t thought about it much. But if it does, then it avoids the file IO from the first approach.

thomasrolinger commented 1 year ago

Fixed the issue with the graph generator: ported more code to Chapel and fixed an off by one error.

thomasrolinger commented 1 year ago

The poor performance on 128 cores is actually expected. On GASNet, the communication is basically serialized:

“ Communication injection is serialized by gasnet on infiniband. Think of communication initiation as being wrapped with a global lock. This kills fine-grain performance on infiniband so more cores can actually degrade performance. This is a known issue, but hasn't been a priority since large messages aren't really impacted since the initiation is a small part of the total comm time (this was also the initial motivation behind adding aggregation though that's also helpful when the network itself doesn't have fast fine-grained comm)”

Knowing this, we will use only 32 cores per node.

thomasrolinger commented 1 year ago

Closing this since we are currently running the experiments: 32 cores per node, 128GB of memory per node.