Open jpjones76 opened 4 years ago
Hello jpjones76,
The parallelization of processing will help a lot of my data processing. One thing I can answer now is that transfer time among CPUs is negligible, or even zero, comparing to the processing time in my workflow as many of processing is done without communication between processors.
In my case, I allocate one processor on one day for 1 year seismic noise processing and there is no communication between days. So the total amount of processes (365 days) are fairly parallelized and we could maximize the efficiency of parallelization.
Since the processes of my project is relatively simple and the size of my dataset is not so large, I don't have the trade-off of communications when parallelizing more complicated processes.
Cheers,
Opening this discussion back up in terms of SeisIO + GPUs.
The first thing we should discuss is do we want to do GPU processing within SeisIO or create a GPU-capable SeisIOGPU package?
Next we need to think about the best way to put SeisChannel/SeisData on the GPU. Some of the challenges I see:
SeisChannel
data lengths prevent using matrix methods for processing on GPUI think the best solution is to allow GPU processing only for well-suited data, i.e no gaps and all channels have same length so that we can store the data in a matrix similar to NodalData
. This is the way I went with the RawData
struct in SeisNoise and it's worked very well.
An example use case where this would be for helpful is template detection.
I agree that the best way to proceed is to require gapless data on GPU. That implies some design issues, though. Let me walk you through where I am with this.
We can't use SeisIO.read_data
because the output will either be an Array{GphysData, 1}
(which seems intractable, for reasons I can explain next video chat), or a preallocated GphysData
object. The latter would use user-specified start and end times with fs
or δt
read from file.
I know a 2D array is needed for nodal data because it's processed with 2D Fourier transforms. How are 2D arrays an advantage for other time-series data, though?
I admit I've done no benchmarking to test speedup of a 2d array on research code (e.g., template matching); have you? If it's significantly faster, I could be talked into this.
Where's the header info for each trace: CPU or GPU? I haven't tested any custom structure that mixed CPU scalars with GPU arrays, but you run your research code on hybrid CPU/GPU platforms, right? So, is it faster to keep the basic header info. on CPU, and time-series data on GPU? If we try that approach, then how do we ensure that hybrid structures store CPU and GPU data on the same node? (Is the last precaution even necessary?)
Reading on the CPU seems like the easiest option for now. This obviates the need for dealing with gaps on the GPU. One nice addition to CUDA.jl is the ability to do asynchronous GPU compute while doing I/O on the CPU.
Processing data as a 2D array removes kernel launch latencies. Here's an example where I find the maximum on each column in an array using a CPU-based and GPU-based for loop:
using CUDA, BenchmarkTools
A = cu(Float32,2^14,1000)
function fastmax(A::AbstractArray)
return maximum(A,dims=1)
end
function slowmax(A::AbstractArray)
out = similar(A,size(A,2))
for ii = 1:size(A,2)
out[ii] = maximum(A[:,ii])
end
return out
end
julia> @benchmark CUDA.@sync fastmax(A)
BenchmarkTools.Trial:
memory estimate: 1.17 KiB
allocs estimate: 47
--------------
minimum time: 2.628 ms (0.00% GC)
median time: 2.881 ms (0.00% GC)
mean time: 2.886 ms (0.00% GC)
maximum time: 3.944 ms (0.00% GC)
--------------
samples: 1727
evals/sample: 1
julia> @benchmark CUDA.@sync slowmax(A)
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/uaFZh/src/host/indexing.jl:43
BenchmarkTools.Trial:
memory estimate: 3.14 MiB
allocs estimate: 118484
--------------
minimum time: 52.364 ms (0.00% GC)
median time: 58.849 ms (0.00% GC)
mean time: 65.354 ms (2.62% GC)
maximum time: 237.738 ms (19.11% GC)
--------------
samples: 77
evals/sample: 1
fastmax
only calls a GPU kernel once, whereas slowmax
calls the kernel 1000 times. Similarly, anything that can be reformulated as matrix multiplication can call CuBLAS rather than using matrix-vector algorithms.
Adapt
sends the data to the current device
but I haven't seen this as a problem yet...
This Issue is a dedicated thread for discussing plans to parallelize parts of SeisIO for multiple CPU workers. GPU computing plans will be discussed in another thread (maybe much) later.
Existing
@async
. If I had any indication that anyone but me was using it, I'd modify it so that one could process channels as new data arrived without processing each channel twice; that would facilitate "streaming" work flows.Ongoing/Planned
get_data
.Possibilities
h5open
and forkread
to workers; spawn aSeisChannel
on each CPU; process data on each CPU separately.SeisData
objects by channel and segment (as now), then further splitting and forking by subsegment. Is this a good idea? I honestly can't tell.Of Limited Use
batch_read
function in my old commits. In Julia 0.5 it was a significant speedup. Once single-CPU file read became well-optimized, the speedup was <20%; but memory overhead can't be reduced below 100%.Input Requested