nv-legate / cunumeric

An Aspiring Drop-In Replacement for NumPy at Scale
https://docs.nvidia.com/cunumeric/24.06/
Apache License 2.0
613 stars 69 forks source link

Add support for `return_inverse=True` to `cunumeric.unique()` #829

Open betatim opened 1 year ago

betatim commented 1 year ago

I wanted to try out cunumeric (as you get it from conda installing it) by using it with some scikit-learn code that relies only on Numpy. In particular the neural-network classifier in scikit-learn. It should benefit from GPU acceleration and is written using only "Python numpy", not custom cython code like many other estimators in scikit-learn.

I hit a snag in that cunumeric.unique() doesn't yet support all the keyword arguments that numpy's unique() supports. In this case return_inverse=True. The code snippet in scikit-learn is https://github.com/scikit-learn/scikit-learn/blob/fabe1606daa8cc275222935769d855050cb8061d/sklearn/model_selection/_split.py#L2074. A more zoomed out view is that this is somewhere down the chain from calling train_test_split which is used to split up the data set.

To run scikit-learn with cunumeric without having to rewrite all the import numpy statements across the scikit-learn code base I used the following snippet that hacks the import mechanism of Python. I think it works reasonably well for quickly exploring what works and what doesn't, but it is an "afternoon hack", so comes with the usual health warnings (also please let me know when you find weird things).

import builtins
import inspect

orig_import = builtins.__import__

def use_numpy(name, fromlist):
    """Decide if sub-module import should use Numpy or cunumeric.

    Used to redirect sub-modules that are not yet supported in cunumeric
    back to numpy.
    """
    if fromlist is None:
        fromlist = tuple()

    for module in ("core", "ma", "lib"):
        if name.startswith(f"numpy.{module}"):
            return True
        if name == "numpy" and module in fromlist:
            return True

    return False

def my_import(name, globals=None, locals=None, fromlist=(), level=0):
    stack = inspect.stack()
    importing_file = stack[1].filename
    # We want to "rewrite" numpy imports when they originate from
    # inside the scikit-learn module, but not otherwise.
    # For example cunumeric should get the real numpy when it imports
    # it.
    if (
        name.startswith("numpy")
        # XXX Path where the scikit-learn code lives
        # XXX Make sure to adjust this for your install
        and "git/legate-exploring/scikit-learn/sklearn" in importing_file
    ):
        if not use_numpy(name, fromlist):
            print(
                f"Rewriting import for '{name}' from {importing_file}"
            )
            name = "cunumeric" + name.removeprefix("numpy")
    return orig_import(
        name, globals=globals, locals=locals, fromlist=fromlist, level=level
    )

builtins.__import__ = my_import

# The actual interesting code/use-case. Everything above is just infrastructure.

import time

from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=100, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=1)

clf = MLPClassifier(random_state=1, max_iter=300)

tic = time.perf_counter_ns()
clf.fit(X_train, y_train)
toc = time.perf_counter_ns()

print(f"Fitting took {(toc-tic)/1_000_000:.0f}ms")

clf.predict_proba(X_test[:1])

clf.predict(X_test[:5, :])

clf.score(X_test, y_test)
manopapad commented 1 year ago

Not the primary issue here, but can you check whether the lgpatch utility works for you, in place of your custom NumPy-import-overriding code? See https://nv-legate.github.io/cunumeric/23.03/user/usage.html#zero-code-change-patching

betatim commented 1 year ago

I tried lgpatch but it didn't start. Only have the following instance in my shell history:

$ lgpatch -patch numpy scikit-learn/mlp.py
usage: lgpatch [-patch PATCH [PATCH ...]] PROG
lgpatch: error: the following arguments are required: PROG

I don't remember the exact error message for lgpatch scikit-learn/mlp.py -patch numpy but that also didn't work. The mlp.py script is the code snippet from my first comment, with the custom import handler not activated.

As far as I can see lgpatch replaces numpy for everyone, which I thought was odd because I thought that cunumeric needs access to "vanilla" numpy. At least that is why Iwent down the route of changing the import hook and looking at where the import statement is located.

manopapad commented 1 year ago

I can reproduce your issues with lgpatch, and reported it as a separate issue.

As far as I can see lgpatch replaces numpy for everyone, which I thought was odd because I thought that cunumeric needs access to "vanilla" numpy. At least that is why Iwent down the route of changing the import hook and looking at where the import statement is located.

The lgpatch script is supposed to replace only application-levels imports, and not cunumeric-internal imports. I haven't looked at it recently to be able to tell you exactly how it does that :-P