Open npadmana opened 5 years ago
I think these graphs from #27 are worth celebrating!
Agreed, I'm really happy with these results, especially given how clean the code looks. I'll do a multi trial run and commit the results/graphs on Monday
And I think I'm really, really done with this for now.
:) I keep telling myself that too
Next issue - get the full simulation running on 18000 cores (if you'll lend me a few!).
Yeah, of course, let me know when.
So far as I know the only other important thing on my plate for this is to talk to the perf team about getting a version of NPB-FT in our perf suite (https://github.com/npadmana/DistributedFFT/issues/42), and then optimizing so we don't need the usePrimitiveComm
stuff. Let me know if there's anything else I'm forgetting.
I think those are the big things. I'll note that I don't really think the copy
step is so far away from user code -- I think the user will often know more about the data than the distribution/compiler, and so, we want to allow optimizations like that one. The only ugly bit was the call into a non-user-facing primitive, which I hope will be exposed soon. But yes, the more you close the gap, the better it'll be for everyone.
Some other little gnats still left -- some might be easy....
I think those are the big things. I'll note that I don't really think the copy step is so far away from user code -- I think the user will often know more about the data than the distribution/compiler, and so, we want to allow optimizations like that one. The only ugly bit was the call into a non-user-facing primitive, which I hope will be exposed soon. But yes, the more you close the gap, the better it'll be for everyone.
Yeah, that's a good point. So maybe the 2 steps there are for us to add the copy wrappers (https://github.com/chapel-lang/chapel/issues/13052) and to optimize the array-slice versions as much as we can to close the gap (with the understanding that copy may still perform better for cases like this.)
Close out #41 just to make sure I don't have a bug.
Responded on that thread.
why my swan runs never seemed to catch up with your crystal runs (#39)
Yeah, will look at
(#25) Why were we getting odd behavior with GCC? Maybe we want to run timings with gcc and with LLVM too.
Added some more information to that thread.
at an earlier point, we had explored the idea of temporarily disabling affinity to interact to openmp. That's a Chapel issue, but I don't know if we ever captured it.
https://github.com/chapel-lang/chapel/issues/9882
I haven't run this on our local cluster yet (and I don't have enough nodes to do a real stress test), but it would be interesting to see how the code does under gasnet (for a cray but also for ib).
Forked off to https://github.com/npadmana/DistributedFFT/issues/49
I am tempted to add the UPC benchmark for the final plots.
Yeah, agreed. Could you add repro instructions to https://github.com/npadmana/DistributedFFT/issues/33, and I'll gather UPC and MPI reference timings in my next full run.
FYI I opened https://github.com/npadmana/DistributedFFT/issues/50 to organize these TODOs
Updated performance numbers from https://github.com/npadmana/DistributedFFT/pull/57 are even better, so it's gotta be 2 beers at this point :)
More seriously, I think we can close this. Full SIM numbers have been gathered and other side-topics have their own issues.
tagging @ronawho ....
I think these graphs from #27 are worth celebrating! Next CHIUW, unless you want to visit New Haven.
And I think I'm really, really done with this for now. Next issue - get the full simulation running on 18000 cores (if you'll lend me a few!).