normalize_total slower than CPU implementation

jpintar commented 8 months ago

Averaging over 130 runs of a trial dataset (15,000 × 15,000) on a variety of GPUs and CPUs, I've noticed that rsc.pp.normalize_total() is slower than scanpy.pp.normalize_total() by a factor of around 4.5. In the best case, the average over ten runs on an NVIDA A100 was 2.06 times slower than running the Scanpy version on a Xeon Platinum 8358. In the worst case, the average over ten runs on a GeForce GTX 1080 Ti was 8.85 times slower than running on a Xeon E5-2697 v4. I haven't had time to look under the hood and figure out what's going on, but I thought I should let you know.

rapids-singlecell 0.9.3 RAPIDS 23.12 CUDA 11.8 CuPy 12.3 Scanpy 1.9.6

Intron7 commented 8 months ago

@jpintar i don't think this is a bug. 15000 cells is not an amount of cells where acceleration shines. What format is your matrix (csr, csc, c or f-continuous)? You also have to account for compile times for the gpu kernel, but that's only the first launch. I can look into the kernels for more optimisations. Can you also please share the real runtime and not just ratios.

jpintar commented 8 months ago

Thanks! This was with a csr-matrix, and throwing out the first run-times. The real average runtimes were 0.110337s on CPU vs 0.513284s on GPU (I can also break those down by CPU/GPU model if that would be helpful). It just stood out as the only step in a toy analysis (pp.highly_variable_genes, pp.normalize_total, pp.log1p, pp.pca, pp.neighbors, tl.leiden) where the GPU version wasn't a massive improvement. Overall, the GPU workflow took 0.2 times the CPU workflow with this dataset (including moving the data to the GPU and back), even with pp.normalize_total running slower.

Intron7 commented 8 months ago

Ok that's actually slower than expected. Did you provide a target_sum?

jpintar commented 8 months ago

No target_sum.

Intron7 commented 8 months ago

Ahh ok. I think that's the bottleneck then. Try it with 10000. I think that will be a lot faster.

jpintar commented 8 months ago

That was it! With target_sum=10000, over seven sets of ten runs, the average time for pp.normalze_total on CPU was 0.11058307s (so basically same as before), but only 0.00009686s on GPU. With this, GPU-to-CPU run-time ratio for the whole toy workflow was 0.12.

Intron7 commented 8 months ago

I'll check if I can speed up the median calculation

Intron7 commented 8 months ago

I ran some tests. I can't reproduce the numbers you are getting. For my test dataset (92666, 25660) I get 43.4 ms for GPU and 471ms for CPU without target_sum. With target_sum=10000 I get 11.5 ms on GPU vs 467 ms on CPU. Thats all without with inplace=False.

Is normalize_total the first thing you do in your workflow?

jpintar commented 8 months ago

Yes, normalize_total is the first step. Not sure what's causing the discrepancy. I tested again with a larger dataset (100,000 × 25,000) on a Xeon Platinum 8352S/NVIDIA A100 80GB node and got (averaged over ten runs):

without target_sum: 1.870052s on GPU, 0.893860s on CPU
with target_sum=10000: 0.000127s on GPU, 0.891473s on CPU
without target_sum and with inplace=False: 1.873045s on GPU, 1.321668s on CPU
with target_sum=10000 and with inplace=False: 0.000284s on GPU, 1.348362s on CPU

On this platform, the original smaller dataset (15,000 × 15,000) gives me:

without target_sum: 0.194899s on GPU, 0.107211s on CPU
with target_sum=10000: 0.000117s on GPU, 0.107495s on CPU
without target_sum and with inplace=False: 0.195746s on GPU, 0.160107s on CPU
with target_sum=10000 and with inplace=False: 0.000266s on GPU, 0.163338 on CPU

Intron7 commented 8 months ago

I just can't reproduce it. I run my tests on and A100 80GB PCIe. The dataset I use for testing, is the lung dataset from the notebooks. It would be nice if you could also use it once just to be 100% sure. My initial concern was that the data just wasn't loaded properly yet. But since you average over 10 runs that can't be the case. I wrote a new kernel for the sparse sum. This is slightly faster. Do you use 32 or 64 bit?

jpintar commented 8 months ago

Sorry about the delay - was travelling so I couldn't get to this immediately.

The data are all float32. But things keep getting stranger. I downloaded the Qian et al. lung dataset (93,575 × 33,694) from the sample notebooks, and reran my test script on both it and my 100,000 × 25,000 toy dataset, in fresh minimal conda environments that differ only in the rapids-singlecell version. I'm getting the following (all times in seconds, again averaging over ten runs after dropping the first run):

	Toy dataset				Lung dataset
	`0.9.2`	`0.9.3`	`0.9.4`	`0.9.5`	`0.9.2`	`0.9.3`	`0.9.4`	`0.9.5`
CPU, no `target_sum`	0.898468	0.889558	0.889898	0.889566	0.495055	0.492184	0.492184	0.493177
GPU, no `target_sum`	1.887408	1.888572	0.020584	0.020677	0.012107	0.012067	0.010960	0.010956
CPU, `target_sum=10000`	0.898757	0.887722	0.888544	0.889203	0.494149	0.490748	0.491338	0.491512
GPU, `target_sum=10000`	0.000042	0.000042	0.000039	0.000038	0.000040	0.000040	0.000039	0.000039

So no slowdown with the lung dataset, just as in your tests. But also no slowdown with my toy dataset with versions after 0.9.3, and as far as I know nothing changed between 0.9.3 and 0.9.4 that accounts for this. And I don't see anything particularly anomalous about my toy dataset - only that it's denser than the lung dataset (8.3% non-zero vs. 3.6%). It was generated by randomly downsampling (with replacement) from real (unpublished) data.

Intron7 commented 8 months ago

I added a new special algorithm that does the summation across the major axis for csr matrices with #112 in v0.9.4. This worked a lot better than I expected for your data. cupy does the summation with a matrix multiplication. I have a custom logic now, that works a little bit different and relays less on atomics. Maybe I should open a PR with this for Cupy.

jpintar commented 8 months ago

Awesome! Thank you for looking into this!

scverse / rapids_singlecell

normalize_total slower than CPU implementation #111