rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.16k stars 526 forks source link

[DOC] Document which algorithms expect Fortran vs. C contiguous data #5929

Open beckernick opened 3 months ago

beckernick commented 3 months ago

For many algorithms, whether the input data is C or Fortran contiguous determines whether an expensive memory copy needs to be made. While this seems innocuous, it can have significant UX implications because it's not well understood by most users and, when it rears its head, it's not obvious based on errors.

We should document this.

viclafargue commented 1 month ago

Opened a PR that should inform users when a possibly useless copy is performed. As stated here, data on host (Numpy arrays and Pandas dataframes) will be copied over to device anyways, cuDF dataframes are deepcopied too and cuDF series are 1D and thus not affected by the issue. Then only cuda array interface compliant arrays (and numba arrays) can be copied only because of data order/contiguousness change. This change should allow the user to be informed.

If the user is informed through logging, is it necessary to also document it? If so, should we add the expected data order/contiguousness on the documentation of each function parameter providing data everywhere in the entire library? What should we do when function parameters are left undocumented (many occurrences)?