rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.69k stars 119 forks source link

Checking for NA without importing pandas #370

Closed char101 closed 7 months ago

char101 commented 7 months ago

Hi,

Importing pandas requires about 500ms on my machine. 500ms might not be long but for a GUI application that need to be restarted over and over it might increase the startup time. Since pandas is only used to check for NA, do you think it is possible to use another way to check for NA without importing pandas.

For example

type(s) == 'pandas._libs.missing.NAType'

str(s) == '<NA>'

NA_hash = (2 ** 61 - 1) if sys.maxsize > 2 ** 32 else (2 ** 31 - 1)
hash(s) == NA_hash

pandas can also only be imported when it is already in sys.modules thus making pandas import zero cost. So something like

if 'pandas' in sys.modules:
    import pandas
    pandas_NA = pandas.NA
else:
    pandas_NA = None

def is_none(s):
    return s is None or s is pandas_NA or type(s) == 'pandas._libs.missing.NAType'
maxbachmann commented 7 months ago

From your suggestions I think comparing the type would be the only option that is at all feasible. It would need to be done as:

type_ = type(s)
if type_.__module__ == 'pandas._libs.missing' and type_.__name__ == 'NAType':
    return True

though. This costs quite a bit of performance process.extract("a", ["a"]*10000) takes 0.6ms instead of 0.46ms, which is around 30% slower. So I would need to move e.g the None check after the string check to reduce this impact (which could make sense anyways since strings are much more likely than None).

One idea I had is something like the following:

if 'pandas' in sys.modules:
    import pandas
    pandas_NA = pandas.NA
else:
    pandas_NA = None

def extract(...):
    if pandas_NA is None and 'pandas' in sys.modules:
           import pandas
           global pandas_NA
           pandas_NA = pandas.NA

I will experiment around with this this evening.

char101 commented 7 months ago

Another idea is to install a hook in sys.meta_path (inspired by importhook) that simply sets pandas_NA when pandas is imported.

Test code

import sys
from importlib.abc import MetaPathFinder

pandas_NA = None

class MyMetaFinder(MetaPathFinder):
    def find_spec(self, fullname, path, target=None):
        global pandas_NA
        if pandas_NA is None and fullname.startswith('pandas.') and 'pandas' in sys.modules and hasattr(sys.modules['pandas'], 'NA'):
            pandas_NA = getattr(sys.modules['pandas'], 'NA')
            idx = sys.meta_path.index(self)
            if idx != -1:
                del sys.meta_path[idx]

sys.meta_path.insert(0, MyMetaFinder())

import pandas

print(pandas_NA) # <NA>
maxbachmann commented 7 months ago

Wow didn't know about this. Pretty sure that's what I will go with.

maxbachmann commented 7 months ago

What's the reason behind fullname.startswith('pandas.')? Does this allow any faster exit than 'pandas' in sys.modules in the case that pandas is not imported?

char101 commented 7 months ago

If fullname == 'pandas' it's the start of partial import of pandas module. Afterwards it will imports a lof of pandas submodules. Actually NA is only available after pandas finished importing all of its submodules.

Tehnically I think this comparison can be eliminated fullname.startswith('pandas.'). I'm not sure if startswith is faster than searching sys.modules.

maxbachmann commented 7 months ago

Yes but that's already checked by 'pandas' in sys.modules and hasattr(sys.modules['pandas'], 'NA') as well. The addition of fullname.startswith('pandas.') only avoids running 'pandas' in sys.modules when not importing pandas as far as I understand.

However from what I can tell the hashmap lookup should actually be faster, so I was a bit confused by the string comparison.

maxbachmann commented 7 months ago

In a quick test "pandas" in sys.modules took around 14ns, while fullname.startswith('pandas.') took around 40ns. So I think I will just leave it out.

char101 commented 7 months ago

This might be faster. In my test, string comparison is faster than hash lookup.

class MyMetaFinder(MetaPathFinder):
    def __init__(self):
        self.pandas_imported = False

    def find_spec(self, fullname, path, target=None):
        global pandas_NA
        if fullname == 'pandas':
            self.pandas_imported = True
        elif self.pandas_imported:
            mod = sys.modules['pandas']
            if hasattr(mod, 'NA'):
                pandas_NA = mod.NA
maxbachmann commented 7 months ago

Yes that's likely faster :+1:

char101 commented 7 months ago

And just because we can, adding slots will make the attribute access faster by skipping dict lookup.

class MyMetaFinder(MetaPathFinder):
    __slots__ = ('pandas_imported', )
maxbachmann commented 7 months ago

After thinking about this a bit more, I decided that the hook into sys.meta_path is a bit too risky and could break things for users. So I decided to go with the initial approach of checking sys.modules.

I will probably make a release with the change later this week.

imaurer commented 6 months ago

Nice improvement.