rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.23k stars 532 forks source link

[FEA] Improve perf of DBScan csr_row_op kernel #2387

Open MatthiasKohl opened 4 years ago

MatthiasKohl commented 4 years ago

Is your feature request related to a problem? Please describe. Scaling DBScan to large sizes requires small batch sizes to avoid running out of memory. The csr_row_op kernel has quite bad performance at small batch sizes.

Details: If we look at this kernel https://github.com/rapidsai/cuml/blob/branch-0.15/cpp/src_prims/sparse/csr.cuh#L649-L670, we can see it will issue B threads (B is the batch size) each with N work (N is the total number of points).

For large N, if we want to keep memory constant as it is limited, B decreases proportionally to N. For example with ~2G memory, for N~1e6 : B~2k and for N~2e6 : B~1k.

For large GPUs, having B around ~1k means that we don't fill the GPU anymore but the work per thread still increases (it is in O(N) see the lambda in the link above).

Describe the solution you'd like A simple solution would be to check N/B and if it is large enough, switch to a different kernel (possibly one which just goes over the dense adjacency matrix and updates counter k with atomic ops, would also allow coalesced access to the dense matrix).

cjnolet commented 4 years ago

Definitely a +1 on this, @MatthiasKohl. It would definitely be valuable to find a better general technique for this. As many of the sparse prims are ultimately going to be moved to / consolidated with cugraphs prims in RAFT, it would be great to see this make use of coalescing and better parallelism.

cjnolet commented 4 years ago

I’ve been thinking about this a little more as I’ve been working on some similar kernels that need to process individual rows of a CSR.

I’m thinking even for cases where the data is extremely sparse, we could use the average degree as a heuristic to set the block size and then map each row of the CSR to individual blocks and have the threads each process the columns in parallel. This would enable warp-level reductions and coalescing as well and, if the average degree grows well beyond the max number of threads, would still enable multi-stage reductions.

MatthiasKohl commented 4 years ago

+1: in general, I like the idea of assigning rows to warps, as it helps with coalescing. There would only be one extreme case left where the number of rows is very small and average degree very large, but there we can use multi-stage reductions as you mentioned. Also, how would we determine average degree? In DBScan, we have it available "for free", but if we need to launch a separate kernel, that might not always be worth it.

cjnolet commented 4 years ago

Also, how would we determine average degree?

I'm thinking the average of the difference of the indptr array might be the fastest way to compute this.

At some point, I wouldn't mind doing some experiments with different ways to parallelize CSR arrays and put the results into a paper. I'm sure there's better heuristics for optimal block size than just the avg degree, I just figure it could be a reasonable starting place to add some more parallelism.

MatthiasKohl commented 4 years ago

yes that makes sense, it's what I thought about, too. But in that case, if we imagine that indptr has 1k-10k elements, which I think is realistic, at least for DBScan with a small batch size, it definitely is, then computing this average degree kernel is basically only launch latency of the kernel (for the reduction, every thread on the GPU only processes ~1 element).

That's OK if the array doesn't turn out to be that sparse but if the array does turn out to be very sparse (say 10k-20k elements total), then you'd pay the full launch latency twice for an overall computation that is very small.

I don't have a better solution though right now, other than maybe being able to suggest an average degree if it's already available, such as in DBScan...

Nyrio commented 3 years ago

While benchmarking, I made the same observation that csr_row_op is a massive bottleneck for large datasets.

I have a few comments regarding the discussion above, though.

Now, we want to fill the GPU while using good access patterns, so I'm thinking of the following approaches:

MatthiasKohl commented 3 years ago

Thanks for looking into this @Nyrio.

Nyrio commented 3 years ago

So, if I recap your idea of combining both approaches, we would:

Notes:

MatthiasKohl commented 3 years ago

Could we discuss this live at some point? I got confused with how rows and columns are actually defined in the implementation.

In any case, I agree with what you're suggesting, and I think it is the way to go, there are just some subtleties I wanted to make sure I understood correctly.

Nyrio commented 3 years ago

Sure, it's better to discuss that live. The definition of rows and columns is quite confusing in the code indeed.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Nyrio commented 3 years ago

This is not an inactive issue. This optimization is important for DBSCAN's performance and it is in my backlog.

tfeher commented 2 years ago

@ahendriksen is working on this