rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.14k stars 526 forks source link

[TASK] Replacing transpose in `input_utils` with cpp transpose #1928

Open divyegala opened 4 years ago

divyegala commented 4 years ago

Describe the bug This line creates a transposed copy of input for a moment, making our mem usage 2x. https://github.com/rapidsai/cuml/blob/49d231f3fef435d0d58ce914e5fac4e01209efe2/python/cuml/utils/input_utils.py#L191 Instead, we could replace this with a cpp prim which modifies the array inplace and reduces the requirement of the copy. (After weeks of discussion with @dantegd during benchmarking on how to avoid this and even going by the CPU route, we finally seem to have found a promising solution). I will look to create a PoC in a PR.

cjnolet commented 4 years ago

Referencing #583 (which this is a duplicate of)

cjnolet commented 4 years ago

Instead, we could replace this with a cpp prim which modifies the array inplace

@divyegala, I missed this detail when initially reading the description in this issue. While I think it's a great idea to use cuML's c++ primitive to gain the benefit of using cublas (which cupy might not be using for this?), I'm not sure the transpose of a non-square matrix can be done in place (eg. without some sort of workspace/copy).

Tagging @teju85 for thoughts.

cjnolet commented 4 years ago

Just verified this w/ UMAP, passing the same pointer for the input and output to the cuML transpose prim on a non-square matrix yields the following error:

cuml/test/test_umap.py Got CUBLAS error 7 at /home/cjnolet/workspace/cuml/cpp/src_prims/linalg/transpose.h:43
CUBLAS_STATUS_INVALID_VALUE
teju85 commented 4 years ago

you're right @cjnolet . Doing in-place transpose on square matrices is much easier. However, even for rectangular matrices, we can do in-place transpose, but requires a bit more complicated decomposition schemes like the one mentioned in this NV Research paper and its corresponding implementation is here. Obviously, this goes without saying that we will be trading off perf for memory capacity by choosing inplace operations.

Is the application so memory capacity limited that we cannot afford to have temporary buffers? In other words, is there a real need to have such an implementation in our ml-prims and probably also expose it along our c++ interface?

cjnolet commented 4 years ago

@teju85, it’s funny you mention the inplace project and corresponding research. We found that as well, though it further exemplifies the non-trivial nature of the problem.

While many of our algorithms accept different matrix layouts (Eg row- vs column-major) it’s often necessary for us to do a conversion under the hood when the input supplied by the user is not in the correct format. As a result of this (temporary) conversion, memory needs to be copied.

While this is not necessarily prohibitive for single GPU algorithms, especially since because the user still has the ability to do the conversion themselves, it does become a problem for distributed environments. Because of the synchronous nature many of our current training algorithms, along with Dask’s partitioning guarantees, having a complete 100% copy of al the training partitions can be completely prohibitive and failures could even appear non-deterministically as Dask spreads partitions around based on the workers available to accept them.

The downside to doing an inplace update here is that the user may be wanting to use this data to execute multiple algorithms concurrently and so modifying that input data would not be good.

@dantegd had the idea that maybe we could supply an option to the user, like inplace=True that would allow them to say “any layout conversions needed to be done under the hood can be done in-place, temporarily, until the algorithm is done executing, then it will be converted back before control is returned to the user”

Further we could control how many of the conversions get performed concurrently on each worker (ideally by limiting to the n_worker_threads configured in dask).

This will not be needed for asynchronous training functions, but it might offer a reasonable near-term solution to make it easier for users without expecting them to know of or have to do a manual conversion.

An in-place transpose would also be useful in UMAP, as we currently need to transpose the results of the spectral embedding initialization. The current code needs to allocate an additional array of size n * n_components (The size of the output embeddings array)