jpjones76 / SeisIO.jl

Julia language support for geophysical time series data
http://seisio.readthedocs.org
Other
47 stars 21 forks source link

Parallel SeisIO: the shape of things to come #27

Open jpjones76 opened 4 years ago

jpjones76 commented 4 years ago

This Issue is a dedicated thread for discussing plans to parallelize parts of SeisIO for multiple CPU workers. GPU computing plans will be discussed in another thread (maybe much) later.

Existing

Ongoing/Planned

Possibilities

Of Limited Use

Input Requested

  1. Am I missing anything that you think will benefit from parallelization? If so, what? Please tell me.
  2. Is there additional functionality that you'd like to see in SeisIO that you think would work better in parallel?
kura-okubo commented 4 years ago

Hello jpjones76,

The parallelization of processing will help a lot of my data processing. One thing I can answer now is that transfer time among CPUs is negligible, or even zero, comparing to the processing time in my workflow as many of processing is done without communication between processors.

In my case, I allocate one processor on one day for 1 year seismic noise processing and there is no communication between days. So the total amount of processes (365 days) are fairly parallelized and we could maximize the efficiency of parallelization.

Since the processes of my project is relatively simple and the size of my dataset is not so large, I don't have the trade-off of communications when parallelizing more complicated processes.

Cheers,

tclements commented 3 years ago

Opening this discussion back up in terms of SeisIO + GPUs.

The first thing we should discuss is do we want to do GPU processing within SeisIO or create a GPU-capable SeisIOGPU package?

Next we need to think about the best way to put SeisChannel/SeisData on the GPU. Some of the challenges I see:

I think the best solution is to allow GPU processing only for well-suited data, i.e no gaps and all channels have same length so that we can store the data in a matrix similar to NodalData. This is the way I went with the RawData struct in SeisNoise and it's worked very well.

An example use case where this would be for helpful is template detection.

jpjones76 commented 3 years ago

I agree that the best way to proceed is to require gapless data on GPU. That implies some design issues, though. Let me walk you through where I am with this.

1. Reading to GPU

We can't use SeisIO.read_data because the output will either be an Array{GphysData, 1} (which seems intractable, for reasons I can explain next video chat), or a preallocated GphysData object. The latter would use user-specified start and end times with fs or δt read from file.

2. Matrix (2d array) vs. vector (1d array)

I know a 2D array is needed for nodal data because it's processed with 2D Fourier transforms. How are 2D arrays an advantage for other time-series data, though?

I admit I've done no benchmarking to test speedup of a 2d array on research code (e.g., template matching); have you? If it's significantly faster, I could be talked into this.

3. Header info ni wapi?

Where's the header info for each trace: CPU or GPU? I haven't tested any custom structure that mixed CPU scalars with GPU arrays, but you run your research code on hybrid CPU/GPU platforms, right? So, is it faster to keep the basic header info. on CPU, and time-series data on GPU? If we try that approach, then how do we ensure that hybrid structures store CPU and GPU data on the same node? (Is the last precaution even necessary?)

tclements commented 3 years ago
  1. Reading on the CPU seems like the easiest option for now. This obviates the need for dealing with gaps on the GPU. One nice addition to CUDA.jl is the ability to do asynchronous GPU compute while doing I/O on the CPU.

  2. Processing data as a 2D array removes kernel launch latencies. Here's an example where I find the maximum on each column in an array using a CPU-based and GPU-based for loop:

using CUDA, BenchmarkTools

A = cu(Float32,2^14,1000)

function fastmax(A::AbstractArray)
    return maximum(A,dims=1)
end

function slowmax(A::AbstractArray)
    out = similar(A,size(A,2))
    for ii = 1:size(A,2)
        out[ii] = maximum(A[:,ii])
    end
    return out 
end

julia> @benchmark CUDA.@sync fastmax(A)
BenchmarkTools.Trial: 
  memory estimate:  1.17 KiB
  allocs estimate:  47
  --------------
  minimum time:     2.628 ms (0.00% GC)
  median time:      2.881 ms (0.00% GC)
  mean time:        2.886 ms (0.00% GC)
  maximum time:     3.944 ms (0.00% GC)
  --------------
  samples:          1727
  evals/sample:     1

julia> @benchmark CUDA.@sync slowmax(A)
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/uaFZh/src/host/indexing.jl:43
BenchmarkTools.Trial: 
  memory estimate:  3.14 MiB
  allocs estimate:  118484
  --------------
  minimum time:     52.364 ms (0.00% GC)
  median time:      58.849 ms (0.00% GC)
  mean time:        65.354 ms (2.62% GC)
  maximum time:     237.738 ms (19.11% GC)
  --------------
  samples:          77
  evals/sample:     1 

fastmax only calls a GPU kernel once, whereas slowmax calls the kernel 1000 times. Similarly, anything that can be reformulated as matrix multiplication can call CuBLAS rather than using matrix-vector algorithms.

  1. I only move the time series data to the GPU. Scalar operations (like the one shown above) are really slow on the GPU, so we tend to avoid them. I use the Adapt.jl package to move arrays within structures to and from the GPU. As to keeping a structure on the same CPU/GPU node-> I think Adapt sends the data to the current device but I haven't seen this as a problem yet...