Closed jakelove closed 2 years ago
@JakeLove Impressive work!
Just a few notes:
Index2Coord
, @JeroenMulkers @JLeliaert what do you think guys?@godsic Thank you for your comments.
All the existing tests have passed and the values in the tables provided have been tested by hand, however I agree automated tests should be added.
Could you elaborate on what you mean by "I would prefer indexing scheme to be consistent with Index2Coord
". The implementation for PositionReal is essentially a mapping of Index2Coord across the whole mesh.
@JakeLove I mean that index2coord
and PositionReal
give the same values for the given index (i, j, k)
, including moving window functionality. The "same" is a bit too strong term here, since index2coord
uses double
precision, while CUDA implementation uses single
one, so "same" within rounding error then.
To be honest, I am a bit surprised that indices pre-computed on CPU and then transferred to GPU when needed turned to be slower. Could you please elaborate a bit on how it was implemented?
Ah yes, I hadn't considered the moving window functionality.
As for the precomputed method, here is my implementation (doesn't work with resizing the mesh currently but is given as a PoC):
var __poslocalcache *data.Slice
var __poslocaliscached bool = false
func PositionLocalCached(dst *data.Slice) {
if !__poslocaliscached {
fmt.Print("Allocating __poslocalcache on device\n")
__poslocalcache = cuda.NewSlice(3, Mesh().Size())
dims := Mesh().Size()
sx := 1.0 / float32(decrement_safe(dims[X]))
sy := 1.0 / float32(decrement_safe(dims[Y]))
sz := 1.0 / float32(decrement_safe(dims[Z]))
fmt.Print("Filling __poslocalcache\n")
cuda.SpatialField(__poslocalcache, sx, sy, sz)
__poslocaliscached = true
}
data.Copy(dst, __poslocalcache)
}
Bencmarking with the script here: https://gist.github.com/JakeLove/d9dcc45f1b8e2103b46f793ac554b0dd
I got times of 140s for the in-place implementation and 152s for the precomputed implementation (over 5 runs average on a GTX1070).
@JakeLove Thanks for sharing. I misunderstood the design, thinking that you calculate indices on CPU side (just with some go code) and then copy them to GPU buffer. Those need updates on mesh resizing and moving window. I think updates on resize are OK, but on moving window might quickly make CPU-side calculation a bottleneck.
Regarding PoC I am a bit surprised it gets slower as device-to-device memcpy
is typically blazing fast. The only reason I can think of is data.Copy
calling memcpy
3 times (once per data component), while in-place implementation makes a single CUDA kernel call. In other words you are observing CUDA overhead.
Finally, I would suggest to do sx := float32(1.0 / dims[X])
where applicable, as it would minimize rounding error.
This extension adds two quantities that represent the positions of each cell. This allows custom fields to depend upon the position of the cell in the Mesh, allowing for spatially dependent effective fields.
Quantity:
PositionLocal
This quantity represents position vectors of the center of each cell from the center of the mesh, normalised from -0.5 to +0.5. For example, with a 2x3x4 grid of cells we get the following mappings from mesh indices toPositionLocal
:Quantity:
PositionReal
This quantity represents the real position vector of the center of each cell from the center of the mesh, units in meters. For example, with a 2x3x4 grid of cells with cell sizes 1nmx2nmx3nm we get the following mappings from mesh indices toPositionWorld
:Implementation is using CUDA kernel that recomputes the indices when needed. I also tried an implementation where the indices are pre-computed onto a GPU allocated slice and copied when needed but this was slower than the current implementation.
An example of using this to create a spatially dependent anisotropy is given below (Image of result at the end):