workspace snapshot like R's save.image()

sandys commented 8 years ago

there are a couple of questions on this on Stackoverflow: http://stackoverflow.com/questions/35465534/how-do-i-save-the-entire-workspace-in-pandas-like-rdata http://stackoverflow.com/questions/12504951/save-session-in-ipython-like-in-matlab

Even the individual dataframe saving is a bit confusing - http://stackoverflow.com/questions/17098654/how-to-store-data-frame-using-pandas-python

From IRC, the following comments were made

You can use hdf5, but in this case, always roundtrip it before using it, even when it was just computed -- unfortunately, the to_hdf/read_hdf won't necessarily give you the dataframe you started with
pickle is undebuggable, non-version-compatible, Python-only, insecure-to-unserliazed, and potentially-incorrect-given-the-right-data

The recent news that Pandas is collaborating with Apache Arrow is encouraging - but what is going to be the long term official way for persisting the workspace ?

A lot of workflows come from R, where the entire workspace is snapshotted (and backed up to s3,etc) . This is perfectly production ready in R - but in Pandas a lot of creative invention is needed to make this work (for example - http://cyrille.rossant.net/moving-away-hdf5/).

jreback commented 8 years ago

you seem to be a bit misinformed.

Both pickle and HDF5 provide perfect fidelity when round-tripped thru serialization. This is certainly not true for most other formats.
When round tripping you essentially have to pick from: fidelity, performance, and compatibility, you have varying degrees of each.
your reference to cyrill's blog is nonsensical, he is referring to a very special case of a large multi-user collaboration. Clearly not what you are referring.
further yes Apache Arrow is vaporware, and will have all of the same constraints / restrictions as the above (they are choosing compatibility, so certainly will lose on the fidelity aspects from a particular platforms perspective).

Of course this is issue not about this at all, rather about persisting a workspace which is in jupyter's domain, or if you want a desktop app, use spyder

So not really sure what you are asking from pandas. We have quite a suite of IO compatibility, see the docs here, you as the user need to choose how to use it.

sandys commented 8 years ago

oh wow - thanks for the info. there's been a lot of retweets for https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces87 with the following quote

"Arrow's cross platform and cross system strengths will enable Python and R to become first-class languages across the entire Big Data stack," said Wes McKinney, creator of Pandas.

BTW - none of these are my claims, this is info that I have gotten from people on pydata IRC channels when I have been attempting to move from R to Pandas. So your comment on nonsensical is perhaps a consequence of widespread misinformation.

Just to be clear - I'm not talking about persisting a workspace configuration. I am talking of persisting the data frames inside a pandas session (which could even be 8-10 GB). For example, here's a snippet on how you can actually load a persisted RData dump into Pandas dataframes.

I'm trying to figure out a production flow for Pandas where I need to have 8-10 GB of data in memory and then incrementally update that every hour. We do this in R by snapshotting all the data frames in one shot, backing them to S3 and then continuing. If our machine fails (on AWS for example), we restore the last snapshot from S3 and just resume. So in that respect, I think I'm close to cyrill's blog.

Is that a feature request that could be accepted ?

jreback commented 8 years ago

@sandys what exactly are requesting.

HDF5 is THE best IMHO solution for something like this.

cyrilles solutions IMHO just make another set of issues. When dealing with multi-user mutli-source systems are not trivial.

sandys commented 8 years ago

Jeff, I'm requesting for something like R save.image which works something like Dill (a python library for saving ALL the data structures in memory).

Saving an individual data frame works perfectly, but saving everything is what's a little tricky (for a data scientist... I'm sure Pandas developers find this pretty easy).

It is one of the most productivity enhancing commands in R. On Feb 18, 2016 7:22 PM, "Jeff Reback" notifications@github.com wrote:

@sandys https://github.com/sandys what exactly are requesting.

HDF5 is THE best IMHO solution for something like this.

cyrilles solutions IMHO just make another set of issues. When dealing with multi-user mutli-source systems are not trivial.

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/12381#issuecomment-185732786.

jreback commented 8 years ago

@sandys then use dill or pls put in a request for spyer and/or jupyter. This is out of scope for pandas.

python is about being explict, while R has somewhat of an opposite philosphy.

sandys commented 8 years ago

Thanks for the reply. Will do - just wanted to highlight one last time that the reason why I believe this should be part of pandas is because this is a fairly common production flow.. Especially in context of the "cloud" (where anything may fail at any moment).

I hesitate to file it on jupyter because I'm not really looking at it as a workspace configuration but rather a data snapshot strategy.

Thanks!

TomAugspurger commented 8 years ago

Agreed with @jreback that this doesn't feel very pythonic. Even some R people seem down on workspace images.

sandys commented 8 years ago

hi Tom, just trying to reply to you - https://stat.ethz.ch/pipermail/r-sig-db/2013q1/001272.html or https://stat.ethz.ch/pipermail/r-sig-db/2013q1/001272.html

there are tons of usecases where models are run in production using snapshots. In a lot of ways, it can be argued that Docker snapshots are un-linux like (exactly what a lot of people from the Puppet or Chef world argue about system composability).

But it is a godsend for production. if I had to guess.. I would say the vast majority of R production flows work like this.

jreback commented 8 years ago

@sandys I appreciate you want to make it 'easy', but in reality you should have a cleaner / reproducible workflow. Simply saving a workspace IMHO is not a good answer, except in 1-off small scale situations.

But it is a godsend for production. if I had to guess.. I would say the vast majority of R production flows work like this.

Not to inflame language wars, but this is just a recipe for disaster. Sure may work right now, but it is not security, resilent, nor reproducible.

sandys commented 8 years ago

hi jeff, I grant you that I might be totally wrong - in which case I am trying to find out how do people build a 10GB model that gets updated incrementally (every 6 hours based on new data) ? What happens if the machine breaks down and I have to spin another server up - I dont want to recompute the 10GB.

do we use a database with Pandas to persist the model - I'm ok with that, but I cant find any real documentation on what are the best practices around that. If I may say so - the only form of serialization is the disk-based ones (that you linked).

So I'm really confused and wondering if the canonical recommended way is to recompute everytime. Is this something that can stand upto the conditions of cloud servers

On Thu, Feb 18, 2016 at 9:32 PM, Jeff Reback notifications@github.com wrote:

@sandys https://github.com/sandys I appreciate you want to make it 'easy', but in reality you should have a cleaner / reproducible workflow. Simply saving a workspace IMHO is not a good answer, except in 1-off small scale situations.

But it is a godsend for production. if I had to guess.. I would say the vast majority of R production flows work like this.

Not to inflame language wars, but this is just a recipe for disaster. Sure may work right now, but it is not security, resilent, nor reproducible.

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/12381#issuecomment-185791241.

jreback commented 8 years ago

@sandys this is a vast and complicated topic. you should start simple and see if that serves your needs. 10GB to be honest is not that big nowadays, you could easily do this with HDF5. E.g. saving intermediate state, and then append, or use a db, or even a flat file and recompute. as the data owner you will have to weight the costs/beneifts and esp the complexity here. Using an opaque tool to do this (like save.image()) does not really help you at all.

pandas-dev / pandas

workspace snapshot like R's save.image() #12381