npadmana / DistributedFFT

6 stars 2 forks source link

UPC code added in for reference #55

Closed npadmana closed 4 years ago

npadmana commented 4 years ago

@ronawho -- here is the UPC code.

Some notes are in notes.md with the code, plus #33 has a few useful notes. Let me know if you have any questions.

ronawho commented 4 years ago

Do you remember how you ran on swan?

I was able to run on swan and crystal, but I see verification failures. Here's an example run:

export CORES_PER_NODE=44 && export NODES=16 && qsub -V -l walltime=01:00:00 -l place=scatter,select=$NODES -I

module unload $(module list -t 2>&1 | grep PrgEnv-)
module load PrgEnv-cray
module load cray-fftw

git clone git@github.com:npadmana/DistributedFFT.git --branch=upc
cd DistributedFFT/runs/upc/
# add -O3 to Makefile CFLAGS/UPCFLAGS
make upc-bench CLASS=DD

export XT_SYMMETRIC_HEAP_SIZE=512M
aprun -n $(($CORES_PER_NODE * $NODES)) $PWD/ft-2d-upc.fftw3.DD $CORES_PER_NODE $NODES

 0>     Result verification failed: CHECKSUMS DIDN'T MATCH
Total running time is 25.549740 s
npadmana commented 4 years ago

Yep

export XT_SYMMETRIC_HEAP_SIZE=512M                                                                                                                    
aprun -n 512 -N 32 ./ft-2d-upc.fftw3.DD 16 32

In general

aprun -n <nx*ny> -N <jobs per node> ./ft-2d-upc.fftw3.<CLASS> <nx> <ny>

I think you need to keep nx and ny powers of two (and I keep the jobs per node also at powers of two). I also try to keep nx and ny as close to equal as possible (i.e. within a factor of 2).

I just checked and this ran just fine.

ronawho commented 4 years ago

Ah, of course -- powers of 2 again.

ronawho commented 4 years ago

Ok, so CCE classic gave me the best performance. Something like:

export CORES_PER_NODE=32 && export NODES=16 && qsub -V -l walltime=01:00:00 -l place=scatter,select=$NODES -I

module unload $(module list -t 2>&1 | grep PrgEnv-)
module load PrgEnv-cray
module swap cce cce/9.0.2-classic
module load cray-fftw

git clone git@github.com:npadmana/DistributedFFT.git --branch=upc
cd DistributedFFT/runs/upc/
# Apply diff below to fix NB ops and timer bug
make upc-bench CLASS=DD

export XT_SYMMETRIC_HEAP_SIZE=1024M
aprun -n $(($CORES_PER_NODE * $NODES)) -N $CORES_PER_NODE $PWD/ft-2d-upc.fftw3.DD $CORES_PER_NODE $NODES

Diff to fix NB ops and timer bug:

```diff diff --git a/runs/upc/fft3d.uph b/runs/upc/fft3d.uph index d3515ef..d7c9d63 100644 --- a/runs/upc/fft3d.uph +++ b/runs/upc/fft3d.uph @@ -9,10 +9,10 @@ /* The nonblocking functions are defined in upc_nb.h in spec 3.1 */ /* If your implementation does not agree with spec 3.1, Here is the place you need to modify */ -/* #include */ +#include /* For example, following two lines are needed to run Cray UPC on hopper */ -#include -#define upc_sync upc_sync_nb +//#include +//#define upc_sync upc_sync_nb #define DO_PUT(HANDLE, DST_PTR, SRC_PTR, NBYTES) do { \ (HANDLE)->comm_handles[(HANDLE)->comm_handle_idx] = upc_memput_nb(DST_PTR, SRC_PTR, NBYTES); \ diff --git a/runs/upc/timers.upc b/runs/upc/timers.upc index 4bff6a5..693ca65 100644 --- a/runs/upc/timers.upc +++ b/runs/upc/timers.upc @@ -5,6 +5,7 @@ #include #include #include +#include static char *FFTimers_descr[T_NUMTIMERS] = {TIMER_STR_NAMES}; typedef uint64_t ft_timer_t; @@ -56,7 +57,7 @@ void timer_clear() uint64_t timer_val(int tid) { #if defined(__UPC_TICK__) - return (uint64_t) upc_ticks_to_ns(FTTimers_total[tid]) * 1000; + return (uint64_t) upc_ticks_to_ns(FTTimers_total[tid]) / 1000; #else return FTTimers_total[tid]; #endif ```

I see better performance using

At scale, timings are competitive with our optimized code. This is actually reassuring since I think it backs our understanding that overlapping comm/compute is what allowed us to beat the MPI version. I'm doing full timings and will add those soon. I will note that the UPC version starts to lose at high scales for size D.

Could you provide a link to where you got this version (The NERSC site appears to be missing the tar, and I couldn't easily find it anywhere else.)

npadmana commented 4 years ago

Ah, yes -- I got it from the wayback machine, spelunking into NERSC's history!

ronawho commented 4 years ago

Ok, cool. I think the version we have is from the hopper era (2 machines ago.) It'd be nice if we had something for edison/cori timeframe, but I don't think that's possible without pinging somebody at nesrc.

npadmana commented 4 years ago

So, the actual page existed until about Jan 2019: https://web.archive.org/web/20190125060816/http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/npb-upc-ft/

The tarball is from 2013 : https://web.archive.org/web/20130306033949/http://www.nersc.gov/assets/Trinity--NERSC-8-RFP/Benchmarks/Jan9/UPC-FT.tar

I believe this was done as part of the Cori procurement. I'm sure if you asked around in the depths of Cray, they might even have timing information (although I'm sure you can't show it to me :-).

ronawho commented 4 years ago

512 node results

Size Chapel UPC MPI
D 1.0 s 3.4 s 1.3 s
E 8.4 s 9.2 s 12.3 s
F 70.0 s 72.0 s 132.2 s
ronawho commented 4 years ago

I believe this was done as part of the Cori procurement. I'm sure if you asked around in the depths of Cray, they might even have timing information (although I'm sure you can't show it to me :-).

I'm happy with just gathering timings for the version you found for now. I'm interested in asking around internally at some point to find the most optimize MPI and UPC implementations to see if there are any more tricks we can learn, but I think using publically available benchmarks for our comparisons is fair and not misleading.

ronawho commented 4 years ago

FYI it looks like http://www.nersc.gov/assets/Trinity--NERSC-8-RFP/Benchmarks/Jan9/UPC-FT.tar is still active