pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.3k stars 17.8k forks source link

Cache repeated mapping and apply #4612

Closed NigelCleland closed 10 years ago

NigelCleland commented 11 years ago

I'm not sure if this functionality exists but it is boiler plate which I need to use quite often.

Typically I need to work with very large datasets, e.g. 10mil rows or more. On these datasets performing a calculations such as taking a date in a specific string format can be sped up incredibly by assessing what needs to be done first.

As such I find myself writing the following boiler plate code for a lot of functions.

def map_strptime(series, format=None):
    mapping = {x: datetime.strptime(x, format) for x in set(series)}
    return series.map(mapping)

Now comparing this implementation v.s a more naive implementation just using the standard apply function, e.g.

def nonmap_strptime(x, format=None):
    return datetime.strptime(x, format)

Usage:

df["Date"] = map_strptime(df["Trading Date"], format="%Y-%m-%d")

# vs.

df["Date"] = df["Trading Date"].apply(nonmap_strptime, format="%Y-%m-%d")

Assessing these two implementations for performance I get the following. Note, my Dataset has 10 million rows across 365 days, e.g. a lot of repeated calculations.

# Mapped version
>>>> %timeit  df["Date"] = map_strptime(df["Trading Date"], format="%Y-%m-%d")
>>>> 900 ms

# Nonmapped version
>>>> %timeit df["Date"] = df["Trading Date"].apply(nonmap_strptime, format="%Y-%m-%d")
>>>> (Still hasn't finished in the time it has taken me to write this issue, 10min plus)

As you can see for some use cases, e.g. when a calculation is being repeated often a simple set - map type wrapper around the operation could improve the performance substantially.

However, if a large number of non-duplicates exists then this would obviously slow the operation down quite substantially as first the calculation would be performed, then the mapping in two steps rather than one.

To get around this the number of unique items and non-unique items could be computed. If a large difference between them exists then a shift to the mapping implementation could impart large performance gains.

I'm not sure as to the best implementation, but something along the lines of:

df.memoapply(*args, **kargs):

if len(set(df[cols]) >> len(df[cols]):
    # use memo version

else:
    # Use standard versions.

This works best for one to one type operations, I'm unsure how well it would work for one to many or for many to one type applications.

Could possible be extended by using the inbuilt merge operations of Pandas and using a multi indexed series for the result?

jreback commented 11 years ago

is there a reason you are not using pd.to_datetime() (which takes a format as well) ?

NigelCleland commented 11 years ago

Mostly being unaware of it (mea culpa).

Using the pd.to_datetime method speeds things up considerably.

I still get some slight performance improvements from using the set based version above but that is most likely due to the lack of exception handling etc.

800ms vs 1.8s using the pd.to_datetime method.

Interestingly supplying a format to the to_datetime method had the effect of slowing it down considerably. E.g. it appears to apply it all one by one again.

So in terms of speed.

  1. The set based version above, 800ms
  2. pd.to_datetime, no format specified, 1.8s

.... (Huge performance loss here)

  1. pd.to_datetime, format specified, 55s
  2. Naive datetime implementation, minutes

Is this helpful? I have a horrible dataset of 10mil rows I can run any other tests on if you'd like.

jreback commented 11 years ago

so in a single series u have a lot of repeated dates? caching would help considerably with that (which is what your method does)

but if they are different for each row then not so much

NigelCleland commented 11 years ago

Yes. Hence the title.

The benefits would obviously be far more substantial the more rows that are repeated which is why a bit of boiler plate to switch between caching and not caching could be useful?

The to_datetime method solves this particular use case. But caching as a method could still be useful in other non date based used cases.

Just thought I'd suggest it as I was reading on one of the mailing lists that many features are trivial to implement, once the developers know about them. Just thought that I would take that to heart and submit this as it was an issue I'd run into quite frequently in my work.

Thank you for putting me on to the .to_datetime method though, it's appreciated.

jreback commented 10 years ago

think about adding a use_cache= in to_datetime to internally cache these types of conversions

jreback commented 10 years ago

this is pretty much solved by #5490