Jupyter Notebook file size

dyuval commented 8 years ago

Hi, first of all I really enjoy using altair. I find it really helpful for creating charts of aggregated statistics over different periods of a time-series. However, using it in Jupyter notebooks results in very large files (about 47MB in a notebook rendering only a single chart).

I wonder if the cause of the issue is the size of the data frame I'm using as input - around 67,000 rows. (Note that the aggregation results in a simple bar chart with about 10 bars)

Is there a way to limit the file size of a chart?

Thanks!

jakevdp commented 8 years ago

Hi, There are two ways to specify data in a chart: using a Pandas dataframe (in which the data itself is converted to JSON and embedded in the notebook) or using either a local or web URL reference (in which the data is loaded into javascript at runtime; e.g. this). The advantage of the first is portability: it will work anywhere, but at the expense of embedding potentially a lot of data in JSON format. The advantage of the second is that data is not explicitly embedded, but you lose portability: if the URL you use is not visible to your notebook or other plot viewer, then you won't be able to display the plot.

Of course, the other option is to export to PNG or SVG, in which case you no longer need access to the data at all. Details on that are here.

jakevdp commented 8 years ago

[Note to self: add this to a FAQ in the documentation for 1.2 release]

jakevdp commented 8 years ago

One things comes to mind: for this situation, we could have some optional flag in ipyvega that would remove the JS output, and replace it with the rendered PNG automatically. @ellisonbg, how hard would that be to add? Maybe something like vega.set_png_output()?

dyuval commented 8 years ago

Thanks for the quick reply Jake.

Using the second input option you mentioned solved my problem. I saved the dataframe to a local JSON file using pandas to_json function (specifying orient ='records' and date_format='iso'), and called Chart() with the path to the file. Thanks!

jakevdp commented 8 years ago

Glad that worked for you! I think it would be a good idea to streamline this somehow, because it's come up a couple times.

pteehan commented 8 years ago

I just hit this problem. I was trying to write to /tmp/ files but found that curiously Altair seemed not to be able to read from them. In the end I just wrote to a temporary json file in the current path as @dyuval suggested. Here's my workaround. It's a bit hacky but it lets you minimize the extra code when generating a plot.

# this goes at the top of the notebook
def to_altair(x):    
    x.to_json('chart.json', orient='records', date_format='iso')
    return('chart.json')
pd.DataFrame.to_altair = to_altair # attach to DataFrame objects

Now you can do this:

Chart(df.to_altair()).mark_point(). etc

jakevdp commented 8 years ago

Thanks @pteehan. If you have thoughts on how to make this sort of thing more convenient within Altair itself, please let us know.

pteehan commented 8 years ago

Maybe Chart(data, embed=False)?

If False and data is a data frame, you could internally call to_json and pass it along as a StringIO object. Or better yet, just drop the data altogether, since the conversion might be expensive.

I would even advocate for embedding turned off by default. As a notebook user I almost never want the data embedded. If you have a data frame of any appreciable size and you make a plot, the file size gets so large that it can crash the browser - it is very easy to do this and a big problem for usability. Portability of the chart object is not a concern for me, because the chart can be re-generated by executing the notebook.

jakevdp commented 8 years ago

Thanks.

Where do you envision the data being stored in the embed=False case? A temporary CSV file in the working directory, perhaps?

For users, I anticipate one area of confusion is that if you use embed=False, the resulting figure/notebook would no longer be portable, unless you know which local CSV file you need to include with it.

jakevdp commented 8 years ago

A StringIO object won't work, because the browser's Javascript needs to access the data in order to show the plot, either via embedded JSON, a local CSV URL, or a CSV file at an http URL.

jakevdp commented 8 years ago

One thing that may work would be to modify the notebook server somehow so that it will serve the CSV file at a particular URL via a StringIO-type construct, without actually creating that file on disk. I'm not sure whether Jupyter is set-up to allow that kind of thing.

jakevdp commented 8 years ago

I added a FAQ discussing this in #255. Until we have a good technical solution, that should help things. Let me know if anything should be added.

pteehan commented 8 years ago

Glad to see you giving this some attention. png-only mode as you've suggested in #258 makes a lot of sense to me. I would also advocate for that being the default behavior from Jupyter if that's at all possible.

Two other disadvantages to the 'write to disk' workaround that you didn't mention in the FAQ. First that it's extra code you have to write every time, and second, you have to pay the read/write penalty, which could be significant (30 seconds maybe?) for large datasets.

The question of portability is interesting because I think we have different perspectives. I guess if you had a notebook that only read from a file and produced a plot then it would be a problem; in that case, you could not re-generate the plot unless you had the data file. But if you have a notebook that loads a dataset into memory, then writes to a temp file, then generates a plot from that file, then there's no need to bring that temp file along; you can just re-run the upstream code if you need to re-generate the plot. Plus if you are distributing or converting the notebook, so long as the plot output is already present there is no need for the source data or to re-generate. So I don't see a portability problem in this case.

Thanks. I am a big fan of Altair, by the way. I think it has huge potential.

jakevdp commented 8 years ago

Great thoughts. I'll try to push on the png-only thing, but I suspect @ellisonbg is the one who really knows how to get that done, as it will require some hacking on the Jupyter side.

rgbkrk commented 8 years ago

One thing that I'd like to think about is payloads for display data in the notebook format being a remote resource. Normally I think of this for the image data (gigantic base64 blobs), especially this has been a pain for SageMathCloud and Realtime collaboration (as stated by @williamstein). This makes me generally wonder about these large JSON payloads as well. We talked about supporting URLs or URIs for image content in Jupyter not too long ago. I want to surface that again now that I'm indexing notebooks and digging deeper into notebook collaboration.

As an aside, it's too bad that there's not some reduction step that could be performed for these larger datasets. A lot of the information is lost once it gets turned into raster by vega embed.

/cc @minrk

pteehan commented 8 years ago

From my point of view the plot is the reduction step. If I knew how to represent the data in condensed form I wouldn't need a plotting library. The difference in perspectives is really interesting here. I hope you don't mind if I expand on this a bit.

I come from a R / ggplot background and I am used to its powerful, expressive API which lets me generate complex plots at close to the speed of thought. 'histogram', 'boxplot', 'trellised scatterplot with points varying by color, log scale on y axis and a smoothed curve' are all very easy once you learn the API, and when you get to this point you have the ability to get into 'flow' where you think about the data and its underlying meaning, rather than about your plotting library. It is hard to overstate how valuable this is, especially when doing exploratory analysis.

Python has lacked a similarly 'speed of thought' plotting library. Matplotlib and Seaborn are too imperative. Pandas has nice functionality but limited. yhat's ggplot port is close but some features are missing and I find it hard to depend on it. The great promise of Altair is that its API is sufficiently flexibile and expressive to support flow, and potentially even more powerful than ggplot - especially with the new releases you're planning.

The problem, though, is that the second I need to think about whether my data is too big to plot, or whether it will cause my notebook size to blow up, or where/how I'm going to write the data locally so I can plot it, etc, I've lost my flow, and you've lost me as a user; I'm reaching for ggplot instead, and if that doesn't work, I'm reaching for R. I'm confident you folks will come up with an appropriate solution to this particular problem. But in a general sense I would like to advocate for 'flow' as a design criteria to guide this type of technical discussion. Or another word for it is 'reduced cognitive load'. To me it is Altair's main selling point and the main gap in the current Python ecosystem that needs to be filled.

For this reason I see this issue to be an absolute showstopper, and I consider a plotting library that embeds a copy of its data in the Jupyter notebook to be unsuitable for analysis workflows. Even with tiny datasets (1MB), a notebook with ten plots is now 10 MB, and a repo with 50 notebooks is now 0.5 GB, and I have to start thinking about this and managing it every time I want to make a plot, and again, you've lost me. With a modest dataset of 50 MB it is literally unusable unless I spend extra effort managing local copies of files. For me the cognitive effort of working around this cancels out the benefits of the simple API.

Thanks, I hope this is helpful.

jakevdp commented 8 years ago

Hi @pteehan – I really appreciate you taking the time to detail these thoughts. I think it's fair to say this is one of the highest-priority issues at this point :smile:

rgbkrk commented 8 years ago

@pteehan - yes, yes, yes! Thank you for writing the user story and mission. It summarizes why I'm using Altair now - declarative, more natural for me to think of.

The goal for what I wrote up would be that it's automatic and unseen by the users. We have some architectural pieces to address (in Jupyter) to make it easier on the libraries and in turn for the users.

jakevdp commented 8 years ago

I just chatted with @ellisonbg and he also mentioned some Jupyter infrastructural pieces that would have to happen in order for us to do something about this. He's going to bring up the required API changes at the next Jupyter developer meeting.

rgbkrk commented 8 years ago

Great, looking forward to it. Excellent topic for next week.

jakevdp commented 8 years ago

Just some technical detail: we don't have any way to render PNGs from Python code, so they have to be rendered with JS. That JS call can happen via nodejs, or by rendering the plot in the browser.

The first option is not great because it introduces some relatively painful dependencies (though we make use of it, optionally, with the chart.savechart() methods)

The second option is tough because it requires some gymnastics in the browser: render the plot in JS, save as PNG, then replace the JS output with the PNG output. Jupyter's API is not well-set-up to do that kind of thing now, but that type of API hook could probably be added if done carefully.

minrk commented 8 years ago

@rgbkrk it sounds like you are referring more to the delivery than the storage.

From my point of view on the Jupyter roadmap, I see this as part of the document state moving to the server / real-time stuff. Once document state is on the server, it should be relatively simple to serve some outputs with URLs and others passing through.

Of course, we can do this now if patience is running out, but it would be more complex because the server doesn't know about references to outputs, and would have a harder time garbage collecting the outputs that are no longer in use. If I recall, SMC does this with a global (service-wide?) content-addressable data store that purges outputs that haven't been accessed after a period of time. We could do something similar, but I think it's easier to do at a larger service-level than a local application.

But that's a bit beside the point of notebook file size, because this wouldn't change the content of the notebook once it's ultimately at rest, just the performance of loading/saving/working with big ones. If we want to explode notebooks, that's a bigger deal, I think, and one with tradeoffs that I have a hard time weighing.

williamstein commented 8 years ago

If I recall, SMC does this with a global (service-wide?) content-addressable data store that purges outputs that haven't been accessed after a period of time.

If you save the notebook to disk, it makes the content permanent. Otherwise, it deletes it after a day. That's it. The actual data is stored long-term in a globally accessible google cloud storage bucket which costs between $0.01 and $.027/GB/month, depending on access parameters.

For Jupyter you could serve images from the jupyter notebook server, maybe stored in a local sqlite database, and just delete that local database when the server is stopped? You might still include the images -- exactly as you do now? -- in the .ipynb file whenever it is saved to disk. The key point is to never ever send a big base-64 encoded image to the browser as part of an output message; instead, send some sort of http reference to the browser client, and let the browser client decide to load from that source directly. When you actually save the jupyter notebook to disk -- which is done by the server on the backend usually, just put the images back in if you want. That would keep things 100% compatible with past versions of Jupyter for now.

Anyway your point to clearly separate "exploding notebook sizes" from "unusable notebooks in the browser" is good to do!

usmcamp0811 commented 7 years ago

the linking to a file method seems to have broken for me.. day before yesterday I made some plots that worked but today the link doesn't seem to work.. the file hasn't moved...

saulshanabrook commented 6 years ago

I am curious if the possible approaches to solve this have changed at all. @jakevdp's earlier comment suggested the possibility of rendering a PNG on the server then sending this to the client to embed in the notebook. So could this be implemented by:

Creating a vega schema on disk from the dataframe and the vega lite spec
Running vg2png on it to create the png
Reading into memory and sending it to the client as the output

This would have the advantage of not having to transmit the full dataset to the client. I am currently doing most of my preprocessing with pandas before sending it to altair, but I would much prefer to do all the filtering/aggregating/computing in altair, so that it is more declarative.

jakevdp commented 6 years ago

We've been discussing this within the context of the new display architecture that's now on master.

One possibility, which I think we will make available soon, would be to save the data to a local file, and have Jupyter load that local file when rendering the visualization. Of course, if the notebook is sent to another location, that data won't be available, but presumably the notebook itself will contain the recipe to re-create that data if it is re-run. And the current display code already does embed a png of the visualization in the notebook, so that could be used as a stand-in for the full rendering if the data file is not available.

ellisonbg commented 6 years ago

We have this approach working for vegalite 2 in altair master now. (enable it with vl.data_transformers.enable('json')). From playing around with it, it does speed things up quite a bit and is successful in reducing notebook size. I will likely begin to use this most of the time in my own teaching and usage.

On Fri, Feb 16, 2018 at 10:48 AM, Jake Vanderplas notifications@github.com wrote:

We've been discussing this within the context of the new display architecture that's now on master.

One possibility, which I think we will make available soon, would be to save the data to a local file, and have Jupyter load that local file when rendering the visualization. Of course, is the notebook is sent to another location, that data won't be available, but presumably the notebook itself will contain the recipe to re-create that data. And the current display code actually does embed a png of the visualization in the notebook, so that could be used as a standin for the full rendering if the data file is not available.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/altair-viz/altair/issues/249#issuecomment-366308038, or mute the thread https://github.com/notifications/unsubscribe-auth/AABr0Nwc3O_FLk0q5r0qY2wOaR-FFqQXks5tVb9YgaJpZM4Kivn8 .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub bgranger@calpoly.edu and ellisonbg@gmail.com

saulshanabrook commented 6 years ago

@ellisonbg Very cool! I am only getting it showing up as <VegaLite 2 object> in jupyterlab. What am I missing?

EDIT: If I put the vl.data_transformers.enable('json') after vg.renderers.enable('default'), then it renders as JSON:

EDIT 2: I upgraded my JupyterLab to the latest (jupyterlab: 0.31.2-py36_0 conda-forge --> 0.31.8-py36_1 conda-forge) and now it won't render the JSON, it will only show <VegaLite 2 object> no matter which order I put them in.

ellisonbg commented 6 years ago

With JupyterLab 0.31, you have to install a separate extension:

jupyter labextension install @jupyterlab/vega3-extension

The next release will have this built in.

On Fri, Feb 16, 2018 at 8:58 PM, Saul Shanabrook notifications@github.com wrote:

@ellisonbg https://github.com/ellisonbg Very cool! I am only getting it showing up as "<VegaLite 2 object>" in jupyterlab. What am I missing?

[image: screen shot 2018-02-16 at 10 57 15 pm] https://user-images.githubusercontent.com/1186124/36337790-d8ede120-136c-11e8-9d5a-0e5d991fcd88.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/altair-viz/altair/issues/249#issuecomment-366414315, or mute the thread https://github.com/notifications/unsubscribe-auth/AABr0CrMmArY2iJO9mHbjVHaBDY20EUjks5tVk5XgaJpZM4Kivn8 .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub bgranger@calpoly.edu and ellisonbg@gmail.com

saulshanabrook commented 6 years ago

@ellisonbg Wow this works! I can now easily display datasets with around a million rows!

Only thing I had to do was make sure I was starting the jupyter server from my home directly, otherwise it was calculating the path wrong.

For the record, I was able to get it working with a Dockerfile like this:

FROM jupyter/scipy-notebook
RUN conda install -y -c conda-forge jupyterlab=0.31.8
RUN jupyter labextension install @jupyterlab/vega3-extension
RUN pip install git+https://github.com/altair-viz/altair

CMD start.sh jupyter lab

and a script like this:

import altair.vegalite.v2 as vl
from altair.vegalite.v2 import api as alt

vl.data_transformers.enable('json')

alt.Chart(
    repairs_with_crimes[['rel_days', 'is_day']]
).mark_line().encode(
    x=alt.X('rel_days:Q', bin=alt.BinParams(step=1)),
    y='count(*):Q',
    color='is_day'
)

ellisonbg commented 6 years ago

the json data transformer can be configured to work for different notebook starting paths...will post later with an example...

On Sat, Feb 17, 2018 at 12:31 PM, Saul Shanabrook notifications@github.com wrote:

@ellisonbg https://github.com/ellisonbg Wow this works! I can now easily display datasets with around a million rows!

Only thing I had to do was make sure I was starting the jupyter server from my home directly, otherwise it was calculating the path wrong.

For the record, I was able to get it working with a Dockerfile like this:

FROM jupyter/scipy-notebookRUN conda install -y -c conda-forge jupyterlab=0.31.8RUN jupyter labextension install @jupyterlab/vega3-extensionCMD start.sh jupyter lab

and some a script like this:

import altair.vegalite.v2 as vlfrom altair.vegalite.v2 import api as alt

vl.data_transformers.enable('json')

alt.Chart( repairs_with_crimes[['rel_days', 'is_day']] ).mark_line().encode( x=alt.X('rel_days:Q', bin=alt.BinParams(step=1)), y='count(*):Q', color='is_day' )

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/altair-viz/altair/issues/249#issuecomment-366465535, or mute the thread https://github.com/notifications/unsubscribe-auth/AABr0HNIiNRTBfKuXG126OOJik4U611tks5tVykRgaJpZM4Kivn8 .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub bgranger@calpoly.edu and ellisonbg@gmail.com

jakevdp commented 6 years ago

Sounds awesome @saulshanabrook!

Thanks for making this work so seamlessly, @ellisonbg!

jakevdp commented 6 years ago

This is fixed as of JupyterLab 0.32.

jakevdp commented 6 years ago

This can now be addressed by running

alt.data_transformers.enable('json')

which will save datasets to file and reference them in the notebook by URL.

vega / altair

Jupyter Notebook file size #249