holoviz / hvplot

A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews
https://hvplot.holoviz.org
BSD 3-Clause "New" or "Revised" License
1.03k stars 99 forks source link

CLI wrapped around hvplot.explorer #1150

Open ahuang11 opened 10 months ago

ahuang11 commented 10 months ago

With hvplot explorer soon supporting xarray/gridded datasets, I think the next logical step for hvplot explorer is a CLI (in addition to ideas from https://github.com/holoviz/hvplot/issues/1149)

From my experience, scientists call ncview or panoply in the terminal to do a quick validation on their datasets. This is useful and convenient because they don't have to:

  1. Create a new .py / .ipynb (or juggle with an old one, updating file paths), which could be tedious + repetitive
  2. Type all the imports, loads, and visualize

Plus, it often supports most legacy, file formats.

The edge that hvplot has over these tools is probably:

  1. a newer looking interface
  2. interactivity on the map itself
  3. datashader support
  4. potentially easier setup (pip install hvplot geoviews)

I think it's valuable to wrap a CLI around hvplot explorer, but not just a simple argparse one, but one that's super user-friendly, like auto-complete, so that it's able to new users are able to immediately jump in and get starting using it--imagine, if the auto-complete can auto-complete the desired -x and -y from the file.

Additional discussion here: https://discourse.pangeo.io/t/do-you-use-panoply-ncview-other-command-line-viz-tool/3693/2

philippjfr commented 10 months ago

@Hoxbro and I were just discussing this and we both felt that there is a huge risk of scope creep here. The explorer is great functionality and fits in well in hvPlot itself because it is simply offering a UI around functionality we already provide. However as we get into the application and the CLI around it we start having to build out a lot of other functionality, including the CLI itself, the data loaders, a nice Panel template and application, and lots more. My feeling is that this should be shipped as a separate package entirely.

maximlt commented 7 months ago

I also believe this should be a separate project, and to gain adoption I'd say it'd need to be distributed not only as a standalone application (e.g. .exe on Windows). For sure that'd be an interesting project, i.e. picking one scientific domain and building a tool that solves practitioners' needs in that field. This is very likely out of scope of hvPlot though.

ahuang11 commented 6 months ago

I still think this is important. Personally, when I used to post-process model output, it was nice to do a quick check to ensure the data looked right. This meant I had to navigate the the data dir, e.g. cd models/output/run/1, pwd, copy the directory path and file name, and then paste it into a notebook. This was quite tedious and lots of boilerplate code (import, read, plot) I had to type every time I wanted to verify output.

So, I imagine a very thin CLI wrapper around hvplot, the kind defaulting to explorer if undefined:

hvplot test.nc -x lon -y lat -c air --groupby time

This internally invokes

import xarray as xr
import hvplot.xarray

ds = xr.open_dataset("test.nc")
ds.hvplot.explorer("lon", "lat", c="air", groupby="time").show()

Or: hvplot test.csv -x time -y temp --kind line

import pandas as pd
import hvplot.pandas

df = pd.read_csv("test.csv")
df.hvplot.line("time", "temp")

There'd be a mapping of extensions to file readers, e.g.

EXTENSIONS_TO_FILE_READER = {
    ".nc": (xr.open_dataset, {}),
    ".csv": (pd.read_csv, {}),
    ".parquet": (pd.read_parquet, {}),
    ".grib": (xr.open_dataset, {"engine": "cfgrib"})
    ...
}

If there's unrecognized file extension, like .grb2, users can do the following to use the .grib reader:

hvplot test.grb2 -x lon -y lat -c air --groupby time --reader grib

Although this isn't 100% comprehensive, I think it could cover at least 65% of the scientists needs, which is enough to gain traction. I don't think it needs to be a standalone app, but that'd be a nice thing to have.

jbednar commented 6 months ago

@ahuang11 's description sounds very reasonable to me.

I'd argue that the file-reader type guessing isn't specific to the CLI reader; it's a valuable function that could be provided to open hvplottable data in general, letting people focus on loading and plotting some data file without necessarily having to understand that Xarray is what you should read NetCDF into and Pandas is what you read Parquet into. Seems useful for people getting started who may know about one data API (typically Pandas) but not others, and doesn't need to be tied to the CLI. Also seems useful for people writing code for working with data, so that they only have to write a switch statement to deal with the various distinct data objects they get back, which is vastly smaller than the number of file formats involved.

So ignoring the file reading, I think I'm agreeing with @ahuang11 that the rest can be a very thin wrapper around hvPlot's Python API. In fact, like @philippjfr says of the Explorer:

The explorer is great functionality and fits in well in hvPlot itself because it is simply offering a UI around functionality we already provide.

I'd think we could say the same of the proposed CLI:

The command-line tool is great functionality and fits in well in hvPlot itself because it is simply offering a CLI for functionality we already provide.

I.e., isn't the proposed CLI just another interface, same as the Explorer? Seems to me like it's much less heavy weight than the Explorer.

we start having to build out a lot of other functionality, including the CLI itself, the data loaders, a nice Panel template and application, and lots more.

I'm not sure what Panel template and application would be needed here. Isn't the Explorer already servable as it is?

I think this is coming down to @Hoxbro , @maximlt , and @philippjfr imagining this to become a full-fledged standalone image-plotting application with widgets and functionality of its own, which I agree would be a separate project in its own repository and potentially expansive in scope. That's a great project for someone else to do, based on hvPlot! But it's not what I think @ahuang11 is proposing and what I'm imagining, which is a very lightweight CLI-based way of invoking hvPlot plotting and the hvPlot Explorer to do whatever they already do.

Maybe the best approach here is to make a PR with an MVP of the proposed CLI along with a list of desired but unimplemented features and a list of non-features (things explicitly not considered in scope). My guess is that such a PR won't be big and the list of unimplemented but desired features won't be long, and that it should be clear whether this indeed can be a simple CLI for the hvPlot Python functionality or if it's in danger of becoming some standalone GUI application that belongs elsewhere.

maximlt commented 6 months ago

The command-line tool is great functionality and fits in well in hvPlot itself because it is simply offering a CLI for functionality we already provide.

Yes, but I seriously doubt that this is going to replace the more specialized and easier to install tools Andrew was mentioning in his first post in scientists' workflow (there wasn't much reaction on the Discourse post).

the list of unimplemented but desired features won't be long

I'd guess the opposite as this interface is going to be very generic and be limiting for really exploring data (let me filter / transform the data I have in this CSV file before plotting it, let me see the original data in a table, etc.).

I'd be more convinced this feature was needed in hvPlot if users were showing more interest (other people commenting, likes on the issue , etc.), if we'd find other places (Bokeh / Plotly / Pandas / Xarray / ggplot2 / etc.) where users asked for this feature (IIRC the interactive interface came from a discussion in Xarray, showing clear interest in it), if we'd find a similar tool that has some good adoption, etc.

My guess is that such a PR won't be big

That's maybe true. If someone embarks on working on this PR I'd just like to say that it'd have to come with documentation (reference + mentions where needed) and tests. I've also recently read on some forums that users find hvPlot not lightweight at all, and they're right, having a new dependency won't improve that so it'd be best if it could be avoided.

I think this is coming down to @Hoxbro , @maximlt , and @philippjfr imagining this to become a full-fledged standalone image-plotting application with widgets and functionality of its own, which I agree would be a separate project in its own repository and potentially expansive in scope

I think that would be a much more useful application with more chance to gain adoption. Maybe it could be based on Lumen, as Lumen does I/O and not hvPlot, and it makes it easier to add filters/transformations and custom views.

ahuang11 commented 6 months ago

it makes it easier to add filters/transformations and custom views.

I think what you're imagining vs what I'm imagining is way different. From personal experience, most scientists are satisfied with a glance of their data to ensure sensible model output, which the PR does; it's quite lightweight IMO.

jbednar commented 6 months ago

Maybe it could be based on Lumen, as Lumen does I/O and not hvPlot, and it makes it easier to add filters/transformations and custom views.

I think you're right that Lumen would be a good base for really addressing the needs of a scientist who is comfortable using the command line but not comfortable with Python and not wanting to work in Jupyter. I was such a person before coming to Python, in fact! My Masters and PhD were largely written in that way -- elaborate shell scripts that invoked commands, each written in various languages, all patched together with shell scripting rather than Python. It wouldn't be crazy to exploit Lumen's declarative interface to build a full-featured data-exploration and handling tool that would fit well into a scientist's or engineer's workflow like that.

But because I've been fully bought into "just use Python for everything" for the 20 years since my PhD, and the world finally seems leaning into that too, I wouldn't be the one to push for such a project. If someone external to these projects sees that potential and wants to go for it, I'd be happy to encourage and advise them; there's tons of cool functionality they could get that way. Meanwhile, I'm very happy to keep this CLI focused squarely on exposing "whatever hvPlot Explorer already does" rather than trying to make the CLI be a complete alternative to writing Python.

maximlt commented 6 months ago

I think you're right that Lumen would be a good base for really addressing the needs of a scientist who is comfortable using the command line but not comfortable with Python and not wanting to work in Jupyter. I was such a person before coming to Python, in fact! My Masters and PhD were largely written in that way -- elaborate shell scripts that invoked commands, each written in various languages, all patched together with shell scripting rather than Python. It wouldn't be crazy to exploit Lumen's declarative interface to build a full-featured data-exploration and handling tool that would fit well into a scientist's or engineer's workflow like that.

But because I've been fully bought into "just use Python for everything" for the 20 years since my PhD, and the world finally seems leaning into that too, I wouldn't be the one to push for such a project. If someone external to these projects sees that potential and wants to go for it, I'd be happy to encourage and advise them; there's tons of cool functionality they could get that way. Meanwhile, I'm very happy to keep this CLI focused squarely on exposing "whatever hvPlot Explorer already does" rather than trying to make the CLI be a complete alternative to writing Python.

I honestly have no idea what you are talking about :) My point was to highlight that a Lumen app would be a much better approach to build a good data explorer. The hvPlot explorer doesn't allow filtering out data which I think would be the first thing I'd implement if I have to build a data explorer.

jbednar commented 6 months ago

I'm agreeing with you. :-) Yes, Lumen's functionality is needed to build a full-featured data explorer, as hvPlot's functionality is too limited to cover what a scientist or engineer needs.

But I'm also trying to make a distinction between a CLI-focused explorer and Lumen itself, because Lumen itself (at least when using the Lumen Builder) already is such an explorer, or at least I think that's a fully valid use of Lumen Builder. But making Lumen Builder be a great data explorer isn't the same as making a full-featured CLI-based interface to Lumen (not GUI, not YAML, and not Python) that exposes its power in a shell-scriptable way. And I'm saying that doing so, putting all the power of Lumen into a CLI and not just the power of hvPlot, is a viable project but not one that I personally would want to undertake. CLI for quick plots only; Python for the rest!

philippjfr commented 6 months ago

I want to make two points here:

  1. Three maintainers on this project clearly expressed their concerns with shipping this feature. I should have chimed in earlier to reassert my concerns and those of the team but I do not think the process here was handled well. Even an MVP PR puts pressure on us to not waste the effort that was put into it and there should be agreement from the core maintainers before such work kicks off.

  2. I might agree that a "lightweight data viewer" is a useful thing but I am also immediately certain that if we ship this it will have immediate scope creep. As @maximlt points out in the PR the first NetCDF file he tried didn't work, so immediately we have to add various forms of error handling to validate that the user input and NetCDF file are structured correctly. Then someone comes along and says, "well hvPlot handles xarray, tabular data, and geopandas" so why wouldn't the CLI support loading and rendering shapefiles? Then someone has to add some custom kwargs to the data loader and suddently the "lightweight" CLI isn't all that lightweight anymore.

And I'm saying that doing so, putting all the power of Lumen into a CLI and not just the power of hvPlot, is a viable project but not one that I personally would want to undertake. CLI for quick plots only; Python for the rest!

Lumen would have to be generalized to support xarray data and some other data formats first anyway but I agree with the point that Lumen + hvPlot is a much more sensible starting point because it encapsulates the basic building blocks that you need which is "data loaders" + "views" + "UI", instead we are now putting data loading code into hvPlot.

ahuang11 commented 6 months ago

I can appreciate the immediate scope creep and data loading shouldn't be in hvplot.

On the other hand, since it's already in a mostly working state (even though it doesn't work for all scenarios), I was thinking maybe it can live in holoviz-dev (or even my personal account) to prevent it from becoming completely wasted / thrown away.

maximlt points out in the PR the first NetCDF file he tried didn't work

Minor change made it work: hvplot sresa1b_ncar_ccsm3-example.nc x=lon y=lat -var=ua

image
jbednar commented 6 months ago

Even an MVP PR puts pressure on us to not waste the effort that was put into it and there should be agreement from the core maintainers before such work kicks off.

I don't think that's consistent with a multi-stakeholder OSS project; there can't be a requirement for prior review. Making a small PR like this makes it clear what it is and what it does, and it can then be accepted or not accepted.

philippjfr commented 6 months ago

In terms of code it's a small PR but it certainly was multiple hours of effort on Andrew's part. Anyone can of course propose any change they want and the risk you take as a external contributor that a PR isn't merged if it doesn't align with the maintainers' goals, however usually you would seek agreement from the maintainers before you make such an effort. To ignore that is of course everyone's prerogative, but here we had a situation where all three maintainers were at minimum skeptical if not opposed and the contribution isn't external, so it's not an issue of requesting prior review but one of team allocation and not wasting effort. Whatever happens, now that it exists we will find a home for it.

jbednar commented 6 months ago

data loading shouldn't be in hvplot.

I agree; for Pandata projects the logical place for data loading code is intake and/or fsspec. Not being able to divide up responsibilities cleanly in that way is an ongoing issue beyond this PR.

maximlt commented 5 months ago

Running:

hvplot sresa1b_ncar_ccsm3-example.nc x=lon y=lat -var=ua

I get this error:

cli.py: error: unrecognized arguments: -var=ua

Happy to try the right version.

ahuang11 commented 5 months ago

Oh I haven't pushed the updates yet.

ahuang11 commented 5 months ago

Okay I just updated it; missed it earlier because it was not a comment on the PR.