Closed NigelCleland closed 10 years ago
is there a reason you are not using pd.to_datetime() (which takes a format as well) ?
Mostly being unaware of it (mea culpa).
Using the pd.to_datetime method speeds things up considerably.
I still get some slight performance improvements from using the set based version above but that is most likely due to the lack of exception handling etc.
800ms vs 1.8s using the pd.to_datetime method.
Interestingly supplying a format to the to_datetime method had the effect of slowing it down considerably. E.g. it appears to apply it all one by one again.
So in terms of speed.
.... (Huge performance loss here)
Is this helpful? I have a horrible dataset of 10mil rows I can run any other tests on if you'd like.
so in a single series u have a lot of repeated dates? caching would help considerably with that (which is what your method does)
but if they are different for each row then not so much
Yes. Hence the title.
The benefits would obviously be far more substantial the more rows that are repeated which is why a bit of boiler plate to switch between caching and not caching could be useful?
The to_datetime method solves this particular use case. But caching as a method could still be useful in other non date based used cases.
Just thought I'd suggest it as I was reading on one of the mailing lists that many features are trivial to implement, once the developers know about them. Just thought that I would take that to heart and submit this as it was an issue I'd run into quite frequently in my work.
Thank you for putting me on to the .to_datetime method though, it's appreciated.
think about adding a use_cache=
in to_datetime to internally cache these types of conversions
this is pretty much solved by #5490
I'm not sure if this functionality exists but it is boiler plate which I need to use quite often.
Typically I need to work with very large datasets, e.g. 10mil rows or more. On these datasets performing a calculations such as taking a date in a specific string format can be sped up incredibly by assessing what needs to be done first.
As such I find myself writing the following boiler plate code for a lot of functions.
Now comparing this implementation v.s a more naive implementation just using the standard apply function, e.g.
Usage:
Assessing these two implementations for performance I get the following. Note, my Dataset has 10 million rows across 365 days, e.g. a lot of repeated calculations.
As you can see for some use cases, e.g. when a calculation is being repeated often a simple set - map type wrapper around the operation could improve the performance substantially.
However, if a large number of non-duplicates exists then this would obviously slow the operation down quite substantially as first the calculation would be performed, then the mapping in two steps rather than one.
To get around this the number of unique items and non-unique items could be computed. If a large difference between them exists then a shift to the mapping implementation could impart large performance gains.
I'm not sure as to the best implementation, but something along the lines of:
This works best for one to one type operations, I'm unsure how well it would work for one to many or for many to one type applications.
Could possible be extended by using the inbuilt merge operations of Pandas and using a multi indexed series for the result?