vega / altair

Declarative statistical visualization library for Python
https://altair-viz.github.io/
BSD 3-Clause "New" or "Revised" License
9.21k stars 783 forks source link

xarray support #891

Open bocklund opened 6 years ago

bocklund commented 6 years ago

Is it possible or planned that xarray be supported natively?

You can't pass in Datasets or DataArrays directly to altair. Using .to_dataframe() doesn't quite work either because xarray creates a hierarchical index, which are not supported.

However, you can reset the index to flatten out the DataFrame.

my_xr_dataset = calc_res.isel(component=1)

df = my_xr_dataset.to_dataframe()
df.reset_index(inplace=True)

alt.Chart(df).mark_circle().encode(
    x='X',
    y='GM',
    color='Phase',
).interactive()

You still have to slice up the Dataset yourself (the point limit can be somewhat limiting unless you slice up the Dataset).

Thoughts?

jakevdp commented 6 years ago

Yes, I think we should support this. It will involve some reworking of the data_transformer architecture that's currently being done in #887, and then adding an xarray transformer to the pipeline.

ellisonbg commented 6 years ago

I think that would be great, but it isn't clear how automatic it could be in general (for different types of xarrays).

On Wed, May 23, 2018 at 8:44 AM, Jake Vanderplas notifications@github.com wrote:

Yes, I think we should support this. It will involve some reworking of the data_transformer architecture that's currently being done in #887 https://github.com/altair-viz/altair/pull/887, and then adding an xarray transformer to the pipeline.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/altair-viz/altair/issues/891#issuecomment-391396029, or mute the thread https://github.com/notifications/unsubscribe-auth/AABr0M7wCEgSXS3_RKDteHTBT8rNtsnOks5t1YPIgaJpZM4UKmBw .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub bgranger@calpoly.edu and ellisonbg@gmail.com

shoyer commented 6 years ago

Calling .to_dataframe().reset_index() makes sense for most xarray.Dataset objects to transform them into tidy data. This is what we recommend with Seaborn, for example.

I don't think there is an unambiguous way to use an xarray.DataArray as input. This object is more similar to a pandas.Series in some ways (it usually represents a single variable) and a pandas.DataFrame in others. I would be inclined to raise here instead of guessing.

Visualizing xarray objects in altair could be awesome, but there are a couple of other challenges here, too:

benbovy commented 1 year ago

(Re-activating this discussion after seeing @mattijn's nice geopython 2023 talk)

File formats for representing the data. CSV isn't terribly efficient for ND-array data, and it's quite easy to run up against Seaborn's default 5000 line limit, which is only a 50x100 matrix. Saving netCDF in the browser could be interesting -- maybe with netcdf-js?

Maybe also worth looking at Zarr Javascript implementations (https://github.com/freeman-lab/zarr-js or https://github.com/gzuidhof/zarr.js/)?

Xarray is usually used for gridded datasets, but Altair only has limited support for these (only heatmaps). Contour and labeled image plots would be nice to have, too.

Contour and image plots would be nice indeed. Altair seems to now have many other features that would potentially be interesting to use with Xarray datasets (gridded data or vector data cubes), e.g., facet plots, maps, parallel coordinates or parameters in the forthcoming release.

mattijn commented 1 year ago

Thanks @benbovy! It would be great if we can push this forward, there a few things at play here. Let me try to mention them briefly.

By design Altair works currently with tabular data, so the only route that is currently possible is to translate your gridded data into a dataframe so you can use the x, y, color encoding channel to create a heatmap. See eg. https://altair-viz.github.io/gallery/simple_heatmap.html, this route is possible for small raster tiles only (but you can get tooltips and can connect it to another chart that displays eg. Aa timeseries for each x/y pixel, similar to this: https://altair-viz.github.io/gallery/select_detail.html). The number of unique data points will eventually bottlenecks performance.

Having said that, what would be great is that we can push the isocontours transform forward. This transform works on native raster-alike data. The good thing is, this is already supported in Vega. For example see this very nice example: https://vega.github.io/vega/examples/annual-precipitation/.

If you look to the source input of the data: https://github.com/vega/vega/blob/main/docs/data/annual-precip.json you can see it is actually a flattened list including info on the shape. Perfect for not super large rasters.

So to make this isocontours transform available in Altair, it needs to become integrated within Vega-Lite first. Luckily it is being raised already https://github.com/vega/vega-lite/issues/6043 and based on the number of emojis this is considered a much requested feature and help/PRs would be surely be appreciated there.

But once there is support for isocontour transforms in Altair, the (flattened) raster data will still be in the json specification since the isocontours are computed within JavaScript. For many occasions this will be fine, but for very large arrays this becomes not a useful approach either (very large json-files) and at that moment we have to precompute the isocontours, to make sure the raw raster data is not within the json specification.

At this moment https://github.com/hex-inc/vegafusion can come in to play. VegaFusion is meant to work for very large datasets where there is an aggregation defined within the altair chart-specification. The core of VegaFusion is in Rust using Arrow and Arrow DataFusion.

If we can introduce support for the array interface protocol in Altair (I assume zarr supports this protocol?) using pyarrow (I think it supports arrays?) we could offer support for:

I noticed there is also a zarr protocol. Is this very different than the array protocol? When would you use it over the array protocol? Does it integrate with arrow?

I might miss other potential routes, so also open for these.

Again, thanks for bringing this back on the agenda!

jonmmease commented 1 year ago

Thanks for the ping @mattijn,

I'd love to see 2D density in Vega-Lite/Altair. It would take some thought, but I'm pretty confident we could support this in VegaFusion as well.

benbovy commented 1 year ago

Thanks @mattijn for the detailed and helpful explanations (I have to admit that I'm not familiar with Vega, Vega-lite nor Altair internals).

I noticed there is also a zarr protocol. Is this very different than the array protocol? When would you use it over the array protocol? Does it integrate with arrow?

I guess my suggestion of using zarr.js (or @shoyer's suggestion of using netcdf.js) was more if there is any need to efficiently transfer (chunked) n-d array data to the browser, possibly via writing the xarray dataset to a temporary zarr/netcdf dataset (similarly to vegafusion widget renderer's feather data transformer for dataframes), and then run some custom data loader or transformer within the browser to convert it into one or more Vega-lite compatible (tabular) datasets. However, I don't know if this makes sense at all. Perhaps easier is to simply define and run custom transformers on the server side?

jonmmease commented 4 months ago

Performance aside, here's an example of displaying regular rasters in Vega-Lite:

visualization

Open the Chart in the Vega Editor

The idea is that the raster element values would be flattened into row-major ordering and inserted into the spec as "data". Then params are used to define the width and height of the raster. A window function is used to add a column with the row number, and the row number and width/height are used to compute the position or each rect (the x, x2, y, y2 values).

Performance of this for large rasters won't be great, even combined with VegaFusion, since the dataset with 1 row per raster element will be sent to the client, and the client has to render each raster element individually. But I've been wondering if it would make sense for VegaFusion to support rendering rect marks like this to images on the server, so that the base64-encoded PNG would be sent to the browser instead of the underlying data. This would be much faster to render in the browser. But it would remove any click/hover/tooltip interactivity, but this might be ok, since for large rasters I'm not used to seeing tooltip. Let me know if anyone has thoughts on this idea!

joelostblom commented 4 months ago

That's a neat approach to supporting images! If a raster mark is eventually added in Vega-Lite, do you think that the VegaFusion solution would still be the higher performance option for images? Then it seems like it would be valuable to implement both to bring the functionality to altair sooner and to provide a high performance option long term.

jonmmease commented 4 months ago

do you think that the VegaFusion solution would still be the higher performance option for images?

I think it would be comparable. I'm not certain yet how the implementation of raster marks in Vega-Lite would work, but I expect the end result would be a Vega image mark that gets displayed. This is how the Vega heatmap transform works (see https://vega.github.io/vega/examples/density-heatmaps/). So what I was thinking about is whether we could go directly from the rect representation to the image mark using VegaFusion. I'm not totally convinced it's a good idea, but something I'm thinking about.

Another angle, that makes this somewhat independent of the raster mark discussion, is that VegaFusion could integrate with Avenger to make it possible to replace any mark with an image rendered on the server. So you could do something like alt.Chart().mark_line(image=True).encode(...) and the line mark would be rendered to an image in Python and only the image would be sent to the browser.

mattijn commented 4 months ago

Within Python-land we could use https://github.com/cogeotiff/rio-tiler to read arrays or images as tiles and in combination with the positioning logic of https://github.com/vega/altair_tiles these tiles can be rendered using mark_image instead of rects. Ie aligning ourselves with TileMatrixSet standard.

While these references originate from the geo-world I think these covers also cartesian unprojected array data.

joelostblom commented 4 months ago

VegaFusion could integrate with Avenger to make it possible to replace any mark with an image rendered on the server. So you could do something like alt.Chart().mark_line(image=True).encode(...) and the line mark would be rendered to an image in Python and only the image would be sent to the browser.

This sounds like a really useful step to integrate with Avenger as you said and be able to provide Datashader-like functionality in Altair, which definitely is a direction that's exciting for me personally! That would also provide a unique value-add of this approach even if an image mark is added eventually in VL.

melonora commented 3 months ago

Hello there,

I am one of the developers of the SpatialData framework https://spatialdata.scverse.org/en/latest/. We are investigating the use of vega (or at least vega like) to store view configurations in the spatialdata zarr store that would allow as much as possible reproducing views across our visualization ecosystem (matplotlib, napari and soon vitessce). Is there any working group currently on xarray support that I could get involved in?

mattijn commented 3 months ago

Hi @melonora! Thank you for chiming in, there is currently not a working group on this topic.

For now, If you have any ideas or feel uncertain on some of these topics. Please ask or share!

I recently add another comment on this related issue, which might be of interest to you as well: https://github.com/vega/altair/issues/3077#issuecomment-2102221395.

Again, thanks for joining this discussion! If there is anything we can do to assist in pushing this forward, please let me know!

joelostblom commented 3 months ago

@jonmmease and I were part of a brief discussion in a hackathon a couple of month ago with some other people from the scverse regarding using Vega-Lite/Altair in some of their subprojects. I'm guessing you are already aware of this @melonora (and maybe you were even there on the hackathon), but if not I can send a ping to the people we were in contact with to chime in here and see if any progress or plans were made.

melonora commented 3 months ago

Hi @mattijn (nice to e-meet you) and @joelostblom , I am indeed one of the people from scverse and was also in the initial calls. One thing that was noticed for the implementation is that we first required a refactor in the spatialdata-plot library.

For short term, it seemed more approachable to subset vega grammar and extend it with what we would need for our image plotting / visualization using matplotlib / napari / vitessce and then see if / how we could feed that back into vega. This was more a decision of what can we do more on the short term:) However, long term it would be nice to see whether we can have SpatialData visualization / plotting fully supported across our visualization ecosystem using vega / vega-lite / altair.

Do you have developer meetings in which we could come to a plan on how to approach this?

mattijn commented 3 months ago

Lets plan one! Can you reach out to me at mattijn[at]gmail.com with your email?

melonora commented 3 months ago

just sent you an email:)

mattijn commented 1 month ago

Cross-referencing raised issues as outcome of the next steps from the discussion below:


LLM summary of feature request:

Based on the Slack discussion, the idea of introducing a new mark type called mark_array in Altair (and consequently in Vega-Lite) to support labeled array data like xarray has a solid foundation. Here’s a proposal to develop this feature, leveraging insights and concerns from the thread:

Proposal for mark_array in Altair/Vega-Lite

Motivation

The current visualization options in Altair/Vega-Lite lack direct support for multidimensional array data such as those provided by xarray. This limitation necessitates cumbersome data transformations that can obscure the structure and meaning of the data. The mark_array aims to streamline the visualization process for labeled array data, providing a more intuitive and efficient approach.

Features and Capabilities

  1. Direct Input of Labeled Arrays:

    • Allow xarray datasets or data arrays to be directly passed to the mark_array function.
    • Ensure compatibility with array structures without needing conversion to DataFrame.
  2. Encoding Dimensions:

    • Utilize array dimensions directly in encoding, e.g., alt.Chart(data).mark_array().encode(x='longitude', y='latitude', color='temperature').
    • Support for multidimensional axes, facilitating complex data representations.
  3. Handling of Unlabeled Arrays:

    • For unlabeled arrays (like numpy arrays), provide a mechanism to name dimensions during chart creation, ensuring seamless integration.
  4. Efficient Rendering:

    • Internally optimize the rendering of large datasets, possibly by leveraging a raster argument or implementing efficient grid rendering techniques.
  5. Versatile Visualization Options:

    • Support various visual representations such as heatmaps, contour plots, and other raster-based visualizations.

Technical Considerations

Related Issues and Discussions

Summary of Slack Discussion

The Slack conversation highlighted various challenges and potential solutions for integrating multidimensional array data visualization in Altair. Key points include:

Next Steps

  1. Draft a detailed proposal and share it with the Altair and Vega-Lite communities for feedback.
  2. Collaborate with developers to prototype the mark_array feature.
  3. Conduct user testing to refine the implementation.
  4. Document the feature comprehensively, including usage examples and performance considerations.

By introducing mark_array, we can significantly enhance the capability of Altair/Vega-Lite to handle complex, multidimensional data natively, thereby broadening the scope of visualizations possible with these powerful libraries.

List of references of the URLs mentioned in the discussion:

  1. Blur-based heatmaps issue in Vega - GitHub
  2. xarray support issue in Altair - GitHub
  3. Support for array interchange protocols in Altair - GitHub
  4. Support for higher dimensional data in vl-contour - GitHub
  5. MNIST image example in Altair - GitHub
  6. Observable Plot: Raster mark - Observable
  7. Heat map - Wikipedia - Wikipedia
  8. Heatmaps in Plotly - Plotly