Closed jpintar closed 8 months ago
@jpintar i don't think this is a bug. 15000 cells is not an amount of cells where acceleration shines. What format is your matrix (csr, csc, c or f-continuous)? You also have to account for compile times for the gpu kernel, but that's only the first launch. I can look into the kernels for more optimisations. Can you also please share the real runtime and not just ratios.
Thanks! This was with a csr-matrix, and throwing out the first run-times. The real average runtimes were 0.110337s on CPU vs 0.513284s on GPU (I can also break those down by CPU/GPU model if that would be helpful). It just stood out as the only step in a toy analysis (pp.highly_variable_genes
, pp.normalize_total
, pp.log1p
, pp.pca
, pp.neighbors
, tl.leiden
) where the GPU version wasn't a massive improvement. Overall, the GPU workflow took 0.2 times the CPU workflow with this dataset (including moving the data to the GPU and back), even with pp.normalize_total
running slower.
Ok that's actually slower than expected. Did you provide a target_sum?
No target_sum
.
Ahh ok. I think that's the bottleneck then. Try it with 10000. I think that will be a lot faster.
That was it! With target_sum=10000
, over seven sets of ten runs, the average time for pp.normalze_total
on CPU was 0.11058307s (so basically same as before), but only 0.00009686s on GPU. With this, GPU-to-CPU run-time ratio for the whole toy workflow was 0.12.
I'll check if I can speed up the median calculation
I ran some tests. I can't reproduce the numbers you are getting. For my test dataset (92666, 25660) I get 43.4 ms for GPU and 471ms for CPU without target_sum
. With target_sum=10000
I get 11.5 ms on GPU vs 467 ms on CPU. Thats all without with inplace=False
.
Is normalize_total
the first thing you do in your workflow?
Yes, normalize_total
is the first step. Not sure what's causing the discrepancy. I tested again with a larger dataset (100,000 × 25,000) on a Xeon Platinum 8352S/NVIDIA A100 80GB node and got (averaged over ten runs):
target_sum
: 1.870052s on GPU, 0.893860s on CPUtarget_sum=10000
: 0.000127s on GPU, 0.891473s on CPUtarget_sum
and with inplace=False
: 1.873045s on GPU, 1.321668s on CPUtarget_sum=10000
and with inplace=False
: 0.000284s on GPU, 1.348362s on CPUOn this platform, the original smaller dataset (15,000 × 15,000) gives me:
target_sum
: 0.194899s on GPU, 0.107211s on CPUtarget_sum=10000
: 0.000117s on GPU, 0.107495s on CPUtarget_sum
and with inplace=False
: 0.195746s on GPU, 0.160107s on CPUtarget_sum=10000
and with inplace=False
: 0.000266s on GPU, 0.163338 on CPUI just can't reproduce it. I run my tests on and A100 80GB PCIe. The dataset I use for testing, is the lung dataset from the notebooks. It would be nice if you could also use it once just to be 100% sure. My initial concern was that the data just wasn't loaded properly yet. But since you average over 10 runs that can't be the case. I wrote a new kernel for the sparse sum. This is slightly faster. Do you use 32 or 64 bit?
Sorry about the delay - was travelling so I couldn't get to this immediately.
The data are all float32
. But things keep getting stranger. I downloaded the Qian et al. lung dataset (93,575 × 33,694) from the sample notebooks, and reran my test script on both it and my 100,000 × 25,000 toy dataset, in fresh minimal conda
environments that differ only in the rapids-singlecell
version. I'm getting the following (all times in seconds, again averaging over ten runs after dropping the first run):
Toy dataset | Lung dataset | |||||||
0.9.2 |
0.9.3 |
0.9.4 |
0.9.5 |
0.9.2 |
0.9.3 |
0.9.4 |
0.9.5 |
|
CPU, no target_sum |
0.898468 | 0.889558 | 0.889898 | 0.889566 | 0.495055 | 0.492184 | 0.492184 | 0.493177 |
GPU, no target_sum |
1.887408 | 1.888572 | 0.020584 | 0.020677 | 0.012107 | 0.012067 | 0.010960 | 0.010956 |
CPU, target_sum=10000 |
0.898757 | 0.887722 | 0.888544 | 0.889203 | 0.494149 | 0.490748 | 0.491338 | 0.491512 |
GPU, target_sum=10000 |
0.000042 | 0.000042 | 0.000039 | 0.000038 | 0.000040 | 0.000040 | 0.000039 | 0.000039 |
So no slowdown with the lung dataset, just as in your tests. But also no slowdown with my toy dataset with versions after 0.9.3
, and as far as I know nothing changed between 0.9.3
and 0.9.4
that accounts for this. And I don't see anything particularly anomalous about my toy dataset - only that it's denser than the lung dataset (8.3% non-zero vs. 3.6%). It was generated by randomly downsampling (with replacement) from real (unpublished) data.
I added a new special algorithm that does the summation across the major axis for csr matrices with #112 in v0.9.4. This worked a lot better than I expected for your data. cupy does the summation with a matrix multiplication. I have a custom logic now, that works a little bit different and relays less on atomics. Maybe I should open a PR with this for Cupy.
Awesome! Thank you for looking into this!
Averaging over 130 runs of a trial dataset (15,000 × 15,000) on a variety of GPUs and CPUs, I've noticed that
rsc.pp.normalize_total()
is slower thanscanpy.pp.normalize_total()
by a factor of around 4.5. In the best case, the average over ten runs on an NVIDA A100 was 2.06 times slower than running the Scanpy version on a Xeon Platinum 8358. In the worst case, the average over ten runs on a GeForce GTX 1080 Ti was 8.85 times slower than running on a Xeon E5-2697 v4. I haven't had time to look under the hood and figure out what's going on, but I thought I should let you know.rapids-singlecell 0.9.3 RAPIDS 23.12 CUDA 11.8 CuPy 12.3 Scanpy 1.9.6