Open guillaumetousignant opened 3 years ago
Maybe allocate once per element, and just offset through that memory for everything?
Each element allocates the number of neighbour faces size_t, and 35 (N + 1) + 14 (N + 1)^2 deviceFloat. Assuming deviceFloat is 8 bytes, at N = 4 elements allocate 4 kB each, and 23 kB at N = 12.
Maybe also use a better cuda malloc implementation, the following link has a link to the implementation and the paper: https://github.com/ax3l/scatteralloc
Time is mostly taken up by memory allocation functions.