Closed char101 closed 7 months ago
From your suggestions I think comparing the type would be the only option that is at all feasible. It would need to be done as:
type_ = type(s)
if type_.__module__ == 'pandas._libs.missing' and type_.__name__ == 'NAType':
return True
though. This costs quite a bit of performance process.extract("a", ["a"]*10000)
takes 0.6ms instead of 0.46ms, which is around 30% slower. So I would need to move e.g the None check after the string check to reduce this impact (which could make sense anyways since strings are much more likely than None).
One idea I had is something like the following:
if 'pandas' in sys.modules:
import pandas
pandas_NA = pandas.NA
else:
pandas_NA = None
def extract(...):
if pandas_NA is None and 'pandas' in sys.modules:
import pandas
global pandas_NA
pandas_NA = pandas.NA
I will experiment around with this this evening.
Another idea is to install a hook in sys.meta_path
(inspired by importhook) that simply sets pandas_NA
when pandas
is imported.
Test code
import sys
from importlib.abc import MetaPathFinder
pandas_NA = None
class MyMetaFinder(MetaPathFinder):
def find_spec(self, fullname, path, target=None):
global pandas_NA
if pandas_NA is None and fullname.startswith('pandas.') and 'pandas' in sys.modules and hasattr(sys.modules['pandas'], 'NA'):
pandas_NA = getattr(sys.modules['pandas'], 'NA')
idx = sys.meta_path.index(self)
if idx != -1:
del sys.meta_path[idx]
sys.meta_path.insert(0, MyMetaFinder())
import pandas
print(pandas_NA) # <NA>
Wow didn't know about this. Pretty sure that's what I will go with.
What's the reason behind fullname.startswith('pandas.')
? Does this allow any faster exit than 'pandas' in sys.modules
in the case that pandas is not imported?
If fullname == 'pandas'
it's the start of partial import of pandas
module. Afterwards it will imports a lof of pandas
submodules. Actually NA
is only available after pandas
finished importing all of its submodules.
Tehnically I think this comparison can be eliminated fullname.startswith('pandas.')
. I'm not sure if startswith
is faster than searching sys.modules
.
Yes but that's already checked by 'pandas' in sys.modules and hasattr(sys.modules['pandas'], 'NA')
as well. The addition of fullname.startswith('pandas.')
only avoids running 'pandas' in sys.modules
when not importing pandas as far as I understand.
However from what I can tell the hashmap lookup should actually be faster, so I was a bit confused by the string comparison.
In a quick test "pandas" in sys.modules
took around 14ns, while fullname.startswith('pandas.')
took around 40ns. So I think I will just leave it out.
This might be faster. In my test, string comparison is faster than hash lookup.
class MyMetaFinder(MetaPathFinder):
def __init__(self):
self.pandas_imported = False
def find_spec(self, fullname, path, target=None):
global pandas_NA
if fullname == 'pandas':
self.pandas_imported = True
elif self.pandas_imported:
mod = sys.modules['pandas']
if hasattr(mod, 'NA'):
pandas_NA = mod.NA
Yes that's likely faster :+1:
And just because we can, adding slots will make the attribute access faster by skipping dict lookup.
class MyMetaFinder(MetaPathFinder):
__slots__ = ('pandas_imported', )
After thinking about this a bit more, I decided that the hook into sys.meta_path
is a bit too risky and could break things for users. So I decided to go with the initial approach of checking sys.modules
.
I will probably make a release with the change later this week.
Nice improvement.
Hi,
Importing
pandas
requires about 500ms on my machine. 500ms might not be long but for a GUI application that need to be restarted over and over it might increase the startup time. Since pandas is only used to check forNA
, do you think it is possible to use another way to check forNA
without importingpandas
.For example
pandas
can also only be imported when it is already insys.modules
thus makingpandas
import zero cost. So something like