Memory efficiency - Githubissues

has2k1 commented 7 years ago

Limit the creation of group_by dataframes. There are cases where the group indices are sufficient, e.g. for the summarise verbs. This may be possible for all verbs that cannot nest other verb operations, i.e not possible with do.
Provide a way for a whole pipeline of manipulations to limit the copying of dataframes. Only the input dataframe needs to be preserved. The current way (modify_input_data option) that approaches this effect is cumbersome and the user has to provide a copy of the input data. Maybe with a function.

ply(
    data,
    define(y='x'),
    define(z='2*y'),
    ...
)

keeler commented 6 years ago

Thank you for building this package, it's great for me so far.

There's noticeable sluggishness for me when aggregating. I want to make the operations faster but keep the original dataframe unmodified (since I don't want to have to reload it if I mess up). After looking at the docs here's what I came up with:

truncate_to_hour = lambda ts: ts.replace(minute=0, second=0, microsecond=0)

options.set_option('modify_input_data', True)
df = (stocks.copy() >>
    mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
    group_by('timestamp') >>
    summarize(high = 'max(high)', low = 'min(low)',
              open = 'first(open)', close = 'last(close)',
              volume = 'sum(volume)', trades = 'sum(trades)')
)
options.set_option('modify_input_data', False)

Maybe this feature could be something like the following?

with plydata.inplace():
    stocks.copy() >>
    mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
    group_by('timestamp') >>
    summarize(high = 'max(high)', low = 'min(low)',
              open = 'first(open)', close = 'last(close)',
              volume = 'sum(volume)', trades = 'sum(trades)')

Or maybe just something like this?

inplace(stocks.copy()) >>
mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
group_by('timestamp') >>
summarize(high = 'max(high)', low = 'min(low)',
          open = 'first(open)', close = 'last(close)',
          volume = 'sum(volume)', trades = 'sum(trades)')

I was thinking mutate instead of inplace but mutate is already a verb of course.

has2k1 commented 6 years ago

There is already a way to use a context manager with the options

from plydata.options import options

with options(modify_input_data=True):
    stocks.copy() >>
    mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
    group_by('timestamp') >>
    summarize(high = 'max(high)', low = 'min(low)',
              open = 'first(open)', close = 'last(close)',
              volume = 'sum(volume)', trades = 'sum(trades)')

I think that is also not convenient enough. Wrapping the dataframe i.e inplace(stocks.copy()) and doing "magic" base on that is not an option. So far the best option is still as noted in point 2 above, it does away with the >> operator which may be a bonus.

keeler commented 6 years ago

Thanks for the link, using a context manager is much better than what I came up with.

Personally, I like that you borrowed the >> syntax from magrittr/dplyr. It denotes piping the data explicitly, and it's easy to translate "thinking in dplyr" into Python. However, I don't find it particularly elegant in Python due to its interaction with newlines. For example, I'm unable to run your example without either wrapping the plydata statements in parens (as in my first example above) or continuing the statement with \ at the end of each line. Otherwise I get this error:

stocks.copy() >>
                    ^
SyntaxError: invalid syntax

Is that why you're saying that it's a bonus to do away with >> by creating the ply() function? Or some other reason?

By the way, I notice you're the only contributor. What kind of contributions (if any) would be helpful to you/this project right now?

has2k1 commented 6 years ago

The ply() method will not replace the >> operator, it will be an alternative albeit with better performance.

You have to wrap multi-line statements in parens, and when I do this I always put >> at the beginning of the line (except when I slip up). Examples here and here.

Overloading the >> operator is "abusing python" so if you care for such things you have a way out.

The API is rather complete, except if dplyr comes up with anything worth stealing or there is a solution to a common Pandas annoyance that is appropriate for this library. The way I have done it so far is to implement the features when I recognise/need the convenience they bring.

The pressing issues are those flagged as enhancement, at the moment all have to do with performance. I think the 2 points raised by this issue would bring the most improvement. However, I want to have proper bench-marking in place first. Wherever there is an option, I do not want to sacrifice performance for convenience.

has2k1 commented 6 years ago

Benchmarks

has2k1 commented 4 years ago

ply method (part 2 of the issue) has been added. Need to think about part 1.

has2k1 / plydata

Memory efficiency #4