Profiling results on CPU

fredRos commented 7 years ago

Not really a problem but posting the results here for reference

Test problem

    nsamples = 512
    real_type = 'float64'
    complex_type = 'complex128'

    wle_m = par1.get('wle_m', 0.003)
    x0_arcsec = par1.get('x0_arcsec', 0.4)
    y0_arcsec = par1.get('y0_arcsec', 10.)

    # generate the samples
    maxuv_generator = 3.e3
    udat, vdat = create_sampling_points(nsamples, maxuv_generator, dtype=real_type)
    x, _, w = generate_random_vis(nsamples, real_type)

    # compute the matrix size and maxuv
    size, minuv, maxuv = matrix_size(udat, vdat, force_nx=4096)
    print("size:{0}, minuv:{1}, maxuv:{2}".format(size, minuv, maxuv))
    uv = pixel_coordinates(maxuv, size).astype(real_type)

    # create model complex image (it happens to have 0 imaginary part)
    reference_image = create_reference_image(size=size, kernel='gaussian', dtype=complex_type)
    ref_complex = reference_image.astype(complex_type)

    chi2_cuda = g_double.chi2(ref_complex, x0_arcsec, y0_arcsec,
                             maxuv/size/wle_m, udat/wle_m, vdat/wle_m, x.real.copy(), x.imag.copy(), w

tools

use Intel amplifier_xe to identify basic hotspots on

OMP_NUM_THREADS=k python/py.test.sh -s python_package/tests/test_galario.py -k profile

serial

4k-serial-callees

So major hotspot is applying the phase. Understandable becauce FFT scales O(n log(n)) and phase is O(n^2). The default version works in the cartesian plane because that's what FFTW expects. It seemed like applying a phase in polar coordinates is easier but it turns out it takes longer due to the conversion, so we should stick with what we have


    // cartesian
    dcomplex const phase = dcomplex{dreal(cos(angle)), dreal(sin(angle))};
    data[idx] = CMPLXMUL(data[idx], phase);

cartesian

    // polar
    dreal const magn = std::abs(data[idx]);
    dreal const oldangle = std::arg(data[idx]);
    data[idx] = std::polar(magn, oldangle + angle);
#endif

polar

comparison: polar takes longer! serial-vs-polar

fredRos commented 7 years ago

A useful list of tricks to reduce OS related noise https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/linuxtips.md for accurate timing

fredRos commented 7 years ago

The timing output by amplifier seems unreliable. With 1 thread, code in libfftw3 takes 1.7 s, with 8 threads, it's 0.001 s. Doesn't make sense! To create a bar chart for the paper, I should output the time elapsed in individual operations both for a GPU and a CPU. ON the cpu, I would use omp_get_wtime(). On the GPU, use the cuda events already in the code but with macros to avoid copying code as much as possible or this C++ struct https://github.com/aramadia/udacity-cs344/blob/master/Unit2%20Code%20Snippets/gputimer.h

fredRos commented 7 years ago

We now have fine-grained timing output enable by cmake -DGALARIO_TIMING=1. Timing of memory allocations and transfer still highly variable for GPU

mtazzari / galario