Parallel SeisIO: the shape of things to come

This Issue is a dedicated thread for discussing plans to parallelize parts of SeisIO for multiple CPU workers. GPU computing plans will be discussed in another thread (maybe much) later.

Existing

The SeedLink client already works in parallel (and has since its inception in 2016). It's a backgrounded process that uses @async. If I had any indication that anyone but me was using it, I'd modify it so that one could process channels as new data arrived without processing each channel twice; that would facilitate "streaming" work flows.

Ongoing/Planned

As mentioned in the paper, SeisDownload.jl by @kura-okubo is being incorporated into SeisIO core. This will affect get_data.

Possibilities

Parallel HDF read support: simple in concept. Add a Boolean flag to h5open and fork read to workers; spawn a SeisChannel on each CPU; process data on each CPU separately.
- Requires a Boolean flag or separate read function; not everyone wants or benefits from this work flow.
- Will require added code for parallel processing on multiple CPUs. See below!
- Open question: how to optimize the tradeoff between transfer time among CPUs and processing time? Can we adaptively determine when this is advantageous? @tclements and @kura-okubo are working on similar computing problems already in their respective packages and a general solution for SeisIO would likely resemble theirs.
- HDF5.jl doesn't support parallel write at present. This is no problem for me, but it's why I only mention read operations.
Fourier transforms
- Setting threads or workers for FFTW improves FFT speed for long sequences but slows it for short ones. The crossover is the point at which transfer time between CPUs exceeds the single-CPU FFT speed (due to N log N scaling); an exact value depends on hardware configuration and number of channels.
- Might be advantageous to do as @tclements does in SeisNoise.jl and adaptively break sequences into segments, but doing so in SeisIO core will mean splitting SeisData objects by channel and segment (as now), then further splitting and forking by subsegment. Is this a good idea? I honestly can't tell.
Parallel processing
- Likely a significant speedup to parallelize and concatenate Fourier-based processing operations (bandpass, instrument response, etc.) so that computational costs of FFT and iFFT are only applied once each per segment.
- This would change workflow for users, but might prove less heinous when one needs multiple FFT-based processing operations (e.g. instrument response, then filtering, then correlation).
Parallel read of legacy file formats designed to hold data from multiple channels (SEED, SUDS, UW) might lead to improvement but I haven't thought about this for non-HDF5 file formats. In particular, a parallel SEED reader/converter is something everyone would benefit from, at least in theory. In practice, I don't know if one can be written in an optimized way, because SEED volumes have no byte index to packets (a fundamental oversight on the part of SEED's creators).

Of Limited Use

Parallel read of single-channel, one-segment files (PASSCAL, SAC). I've tried this before: look for the batch_read function in my old commits. In Julia 0.5 it was a significant speedup. Once single-CPU file read became well-optimized, the speedup was <20%; but memory overhead can't be reduced below 100%.

Input Requested

Am I missing anything that you think will benefit from parallelization? If so, what? Please tell me.
Is there additional functionality that you'd like to see in SeisIO that you think would work better in parallel?

Hello jpjones76,

The parallelization of processing will help a lot of my data processing. One thing I can answer now is that transfer time among CPUs is negligible, or even zero, comparing to the processing time in my workflow as many of processing is done without communication between processors.

In my case, I allocate one processor on one day for 1 year seismic noise processing and there is no communication between days. So the total amount of processes (365 days) are fairly parallelized and we could maximize the efficiency of parallelization.

Since the processes of my project is relatively simple and the size of my dataset is not so large, I don't have the trade-off of communications when parallelizing more complicated processes.

Cheers,

Opening this discussion back up in terms of SeisIO + GPUs.

The first thing we should discuss is do we want to do GPU processing within SeisIO or create a GPU-capable SeisIOGPU package?

Next we need to think about the best way to put SeisChannel/SeisData on the GPU. Some of the challenges I see:

gaps are a real problem for processing time-series on the GPU
Non-uniform SeisChannel data lengths prevent using matrix methods for processing on GPU
reimplementing all existing loop-based methods on the GPU will be non-trivial

I think the best solution is to allow GPU processing only for well-suited data, i.e no gaps and all channels have same length so that we can store the data in a matrix similar to NodalData. This is the way I went with the RawData struct in SeisNoise and it's worked very well.

An example use case where this would be for helpful is template detection.

I agree that the best way to proceed is to require gapless data on GPU. That implies some design issues, though. Let me walk you through where I am with this.

1. Reading to GPU

We can't use SeisIO.read_data because the output will either be an Array{GphysData, 1} (which seems intractable, for reasons I can explain next video chat), or a preallocated GphysData object. The latter would use user-specified start and end times with fs or δt read from file.

2. Matrix (2d array) vs. vector (1d array)

I know a 2D array is needed for nodal data because it's processed with 2D Fourier transforms. How are 2D arrays an advantage for other time-series data, though?

I admit I've done no benchmarking to test speedup of a 2d array on research code (e.g., template matching); have you? If it's significantly faster, I could be talked into this.

3. Header info ni wapi?

Where's the header info for each trace: CPU or GPU? I haven't tested any custom structure that mixed CPU scalars with GPU arrays, but you run your research code on hybrid CPU/GPU platforms, right? So, is it faster to keep the basic header info. on CPU, and time-series data on GPU? If we try that approach, then how do we ensure that hybrid structures store CPU and GPU data on the same node? (Is the last precaution even necessary?)

Reading on the CPU seems like the easiest option for now. This obviates the need for dealing with gaps on the GPU. One nice addition to CUDA.jl is the ability to do asynchronous GPU compute while doing I/O on the CPU.
Processing data as a 2D array removes kernel launch latencies. Here's an example where I find the maximum on each column in an array using a CPU-based and GPU-based for loop:

using CUDA, BenchmarkTools

A = cu(Float32,2^14,1000)

function fastmax(A::AbstractArray)
    return maximum(A,dims=1)
end

function slowmax(A::AbstractArray)
    out = similar(A,size(A,2))
    for ii = 1:size(A,2)
        out[ii] = maximum(A[:,ii])
    end
    return out 
end

julia> @benchmark CUDA.@sync fastmax(A)
BenchmarkTools.Trial: 
  memory estimate:  1.17 KiB
  allocs estimate:  47
  --------------
  minimum time:     2.628 ms (0.00% GC)
  median time:      2.881 ms (0.00% GC)
  mean time:        2.886 ms (0.00% GC)
  maximum time:     3.944 ms (0.00% GC)
  --------------
  samples:          1727
  evals/sample:     1

julia> @benchmark CUDA.@sync slowmax(A)
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/uaFZh/src/host/indexing.jl:43
BenchmarkTools.Trial: 
  memory estimate:  3.14 MiB
  allocs estimate:  118484
  --------------
  minimum time:     52.364 ms (0.00% GC)
  median time:      58.849 ms (0.00% GC)
  mean time:        65.354 ms (2.62% GC)
  maximum time:     237.738 ms (19.11% GC)
  --------------
  samples:          77
  evals/sample:     1

fastmax only calls a GPU kernel once, whereas slowmax calls the kernel 1000 times. Similarly, anything that can be reformulated as matrix multiplication can call CuBLAS rather than using matrix-vector algorithms.

I only move the time series data to the GPU. Scalar operations (like the one shown above) are really slow on the GPU, so we tend to avoid them. I use the Adapt.jl package to move arrays within structures to and from the GPU. As to keeping a structure on the same CPU/GPU node-> I think Adapt sends the data to the current device but I haven't seen this as a problem yet...

jpjones76 / SeisIO.jl