kynan / nbstripout

strip output from Jupyter and IPython notebooks
Other
1.19k stars 94 forks source link

Possible nbstripout-fast integration #179

Open mlucool opened 1 year ago

mlucool commented 1 year ago

Coming from https://github.com/kynan/nbstripout/issues/33#issuecomment-1345175852.

Hi,

This is a really excellent project, but like above I found it had too much overhead on larger repos. I wrote a pure rust version (with python bindings so it can be pip installed) located at https://github.com/deshaw/nbstripout-fast (happed to chose the same name as @stas00). My testing shows this is ~200x faster. Really intersted to hear your thoughts @kynan. This is not a true a 1:1 match, but all key features should be included and a few more added.

Is there a way in which this project could let users choose to use the rust version? Your install/setups scripts are great and clearly this project is very very popular. I think some sort of linkage there would be a net positive for the community.

@kynan

Is there a way in which this project could let users choose to use the rust version?

In theory yes, however doesn't it feel a little odd to have one tool install another?

Can you give more detail on what exactly you're thinking of? How do you see nbstripouts installation work when choosing nbstripout-fast ? I also think this discussion should be moved to a separate issue.

Rust is not ideal for everyone because 1) we can't have wheels for every OS and 2) it's harder to tinker with. I was thinking something along the lines of if this project added nbstripout-fast as an optional dependency that can gracefully fail then IFF nbstripout-fast installed and the options you want to use overlap (most do), it calls nbstripout-fast, if not it continues to call the python code (of course we can add a flag to force one or the other).

Mostly I just wanted to put the thought out there and see if there was any traction. I totally understand if this does not seem like something you want to do for this project as it's not a super clean design. Open to other ideas as well.

arobrien commented 1 year ago

A few projects have optional dependencies on accelerator libraries. For example GeoPandas will use pygeos if available, otherwise fallback to alternative implementations (with configuration overrides). I think this serves as a good model for how this speedup can be made available to users who need it, without fracturing the ecosystem.

# _compat.py

INSTALL_PYGEOS_ERROR = "To use PyGEOS within GeoPandas, you need to install PyGEOS: \
'conda install pygeos' or 'pip install pygeos'"

try:
    import pygeos  # noqa

    # only automatically use pygeos if version is high enough
    if Version(pygeos.__version__) >= Version("0.8"):
        HAS_PYGEOS = True
        PYGEOS_GE_09 = Version(pygeos.__version__) >= Version("0.9")
        PYGEOS_GE_010 = Version(pygeos.__version__) >= Version("0.10")
    else:
        warnings.warn(
            "The installed version of PyGEOS is too old ({0} installed, 0.8 required),"
            " and thus GeoPandas will not use PyGEOS.".format(pygeos.__version__),
            UserWarning,
        )
        HAS_PYGEOS = False
except ImportError:
    HAS_PYGEOS = False

def set_use_pygeos(val=None):
    """
    Set the global configuration on whether to use PyGEOS or not.
    The default is use PyGEOS if it is installed. This can be overridden
    with an environment variable USE_PYGEOS (this is only checked at
    first import, cannot be changed during interactive session).
    Alternatively, pass a value here to force a True/False value.
    """
    global USE_PYGEOS
    global USE_SHAPELY_20
    global PYGEOS_SHAPELY_COMPAT

    env_use_pygeos = os.getenv("USE_PYGEOS", None)

    if val is not None:
        USE_PYGEOS = bool(val)
    else:
        if USE_PYGEOS is None:

            USE_PYGEOS = HAS_PYGEOS

            if env_use_pygeos is not None:
                USE_PYGEOS = bool(int(env_use_pygeos))

    # validate the pygeos version
    if USE_PYGEOS:
        try:
            import pygeos  # noqa

            # validate the pygeos version
            if not Version(pygeos.__version__) >= Version("0.8"):
                if SHAPELY_GE_20:
                    USE_PYGEOS = False
                    warnings.warn(
                        "The PyGEOS version is too old, and Shapely >= 2 is installed, "
                        "thus using Shapely by default and not PyGEOS."
                    )
                else:
                    raise ImportError(
                        "PyGEOS >= 0.8 is required, version {0} is installed".format(
                            pygeos.__version__
                        )
                    )

            # Check whether Shapely and PyGEOS use the same GEOS version.
            # Based on PyGEOS from_shapely implementation.

            from shapely.geos import geos_version_string as shapely_geos_version
            from pygeos import geos_capi_version_string

            # shapely has something like: "3.6.2-CAPI-1.10.2 4d2925d6"
            # pygeos has something like: "3.6.2-CAPI-1.10.2"
            if not shapely_geos_version.startswith(geos_capi_version_string):
                warnings.warn(
                    "The Shapely GEOS version ({}) is incompatible with the GEOS "
                    "version PyGEOS was compiled with ({}). Conversions between both "
                    "will be slow.".format(
                        shapely_geos_version, geos_capi_version_string
                    )
                )
                PYGEOS_SHAPELY_COMPAT = False
            else:
                PYGEOS_SHAPELY_COMPAT = True

        except ImportError:
            raise ImportError(INSTALL_PYGEOS_ERROR)

    if USE_PYGEOS and env_use_pygeos is None and SHAPELY_GE_20:
        warnings.warn(
            "Shapely 2.0 is installed, but because PyGEOS is also installed, "
            "GeoPandas will still use PyGEOS by default for now. To force to use and "
            "test Shapely 2.0, you have to set the environment variable USE_PYGEOS=0. "
            "You can do this before starting the Python process, or in your code "
            "before importing geopandas:"
            "\n\nimport os\nos.environ['USE_PYGEOS'] = '0'\nimport geopandas\n\n"
            "In a future release, GeoPandas will switch to using Shapely by default. "
            "If you are using PyGEOS directly (calling PyGEOS functions on geometries "
            "from GeoPandas), this will then stop working and you are encouraged to "
            "migrate from PyGEOS to Shapely 2.0 "
            "(https://shapely.readthedocs.io/en/latest/migration_pygeos.html).",
            stacklevel=6,
        )

    USE_SHAPELY_20 = (not USE_PYGEOS) and SHAPELY_GE_20

set_use_pygeos()

I think that the (having not done any profiling of my own....) core slowdowns are likely happening in the file read/write, and in the strip_output() method (along with the equivalent zeppelin method. strip_output() isn't really too complicated, so maintaining two parallel implementations wouldn't be too hard, given a robust enough test suite. I want to restructure the code to be a little bit more modular in the area, including creating a StripOptions class or similar, anyway, which would facilitate a clearer and more robust interface between these two projects.

mlucool commented 1 year ago

I think that the (having not done any profiling of my own....) core slowdowns are likely happening in the file read/write, and in the strip_output() method

My testing showed even starting python was too slow if it needed to be done 100s of times (once per file). That being said, there is also a git protocol to start the binary once and reuse that proces; I could see this python overhead time being negligible then*. Doing this easily depends on maturin support for shipping both python and bin (https://github.com/PyO3/maturin/issues/368), calling nbstripout-fast from python is possible. Today it's setup to use python bindings during unit testing and creating a bin with wheels for release.

*This may speedup nbstripout without nbstripout-fast

kynan commented 1 year ago

I think to automatically select nbstripout-fast when installed we'd need to ensure that both really are compatible and nbstripout's tests aren't nearly robust enough to make this possible.

I'd prefer to go about this the other way around: how about we refactor the installation commands to be modular enough so nbstripout-fast can reuse them without reinventing the wheel? Would that work for you?

mlucool commented 1 year ago

how about we refactor the installation commands to be modular enough so nbstripout-fast can reuse them without reinventing the wheel? Would that work for you?

That seems like a good starting point given where the projects are at now

kynan commented 1 year ago

If you want to contribute to this refactoring let me know. Might be able to find some time over the holiday break otherwise.

mlucool commented 1 year ago

Happy to review if you do find some time to work on this!