Open has2k1 opened 7 years ago
Thank you for building this package, it's great for me so far.
There's noticeable sluggishness for me when aggregating. I want to make the operations faster but keep the original dataframe unmodified (since I don't want to have to reload it if I mess up). After looking at the docs here's what I came up with:
truncate_to_hour = lambda ts: ts.replace(minute=0, second=0, microsecond=0)
options.set_option('modify_input_data', True)
df = (stocks.copy() >>
mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
group_by('timestamp') >>
summarize(high = 'max(high)', low = 'min(low)',
open = 'first(open)', close = 'last(close)',
volume = 'sum(volume)', trades = 'sum(trades)')
)
options.set_option('modify_input_data', False)
Maybe this feature could be something like the following?
with plydata.inplace():
stocks.copy() >>
mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
group_by('timestamp') >>
summarize(high = 'max(high)', low = 'min(low)',
open = 'first(open)', close = 'last(close)',
volume = 'sum(volume)', trades = 'sum(trades)')
Or maybe just something like this?
inplace(stocks.copy()) >>
mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
group_by('timestamp') >>
summarize(high = 'max(high)', low = 'min(low)',
open = 'first(open)', close = 'last(close)',
volume = 'sum(volume)', trades = 'sum(trades)')
I was thinking mutate
instead of inplace
but mutate is already a verb of course.
There is already a way to use a context manager with the options
from plydata.options import options
with options(modify_input_data=True):
stocks.copy() >>
mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
group_by('timestamp') >>
summarize(high = 'max(high)', low = 'min(low)',
open = 'first(open)', close = 'last(close)',
volume = 'sum(volume)', trades = 'sum(trades)')
I think that is also not convenient enough. Wrapping the dataframe i.e inplace(stocks.copy())
and doing "magic" base on that is not an option. So far the best option is still as noted in point 2 above, it does away with the >>
operator which may be a bonus.
Thanks for the link, using a context manager is much better than what I came up with.
Personally, I like that you borrowed the >>
syntax from magrittr/dplyr. It denotes piping the data explicitly, and it's easy to translate "thinking in dplyr" into Python. However, I don't find it particularly elegant in Python due to its interaction with newlines. For example, I'm unable to run your example without either wrapping the plydata statements in parens (as in my first example above) or continuing the statement with \ at the end of each line. Otherwise I get this error:
stocks.copy() >>
^
SyntaxError: invalid syntax
Is that why you're saying that it's a bonus to do away with >>
by creating the ply()
function? Or some other reason?
By the way, I notice you're the only contributor. What kind of contributions (if any) would be helpful to you/this project right now?
The ply()
method will not replace the >>
operator, it will be an alternative albeit with better performance.
You have to wrap multi-line statements in parens, and when I do this I always put >>
at the beginning of the line (except when I slip up). Examples here and here.
Overloading the >>
operator is "abusing python" so if you care for such things you have a way out.
The API is rather complete, except if dplyr comes up with anything worth stealing or there is a solution to a common Pandas annoyance that is appropriate for this library. The way I have done it so far is to implement the features when I recognise/need the convenience they bring.
The pressing issues are those flagged as enhancement, at the moment all have to do with performance. I think the 2 points raised by this issue would bring the most improvement. However, I want to have proper bench-marking in place first. Wherever there is an option, I do not want to sacrifice performance for convenience.
ply
method (part 2 of the issue) has been added. Need to think about part 1.
Limit the creation of group_by dataframes. There are cases where the group indices are sufficient, e.g. for the summarise verbs. This may be possible for all verbs that cannot nest other verb operations, i.e not possible with
do
.Provide a way for a whole pipeline of manipulations to limit the copying of dataframes. Only the input dataframe needs to be preserved. The current way (
modify_input_data
option) that approaches this effect is cumbersome and the user has to provide a copy of the input data. Maybe with a function.