Plotting operational ICON data and benchmarking

guidocioni commented 3 years ago

This is not a bug but instead a request of clarifications and additional infos.

I have a suite of scripts which download, process and plot data from the operational ICON model run at DWD and available on opendata.dwd.de. This data contains a LOT of points so optimizing the processing and plotting of data is paramount in order to maintain the script operational every day.

I'm trying to understand whether a transition from my combination of basemap and matplotlib to psyplot would be possible. In order to do that I'm trying to benchmark execution time of psyplot when doing some of my standard plots. In theory, as far as I understand, for filled contour plots psyplot is calling pyplot.tricontourf under the hood, so the difference should not be that big. The only difference, I guess, is an additional layer of preprocessing on the data, which I'm doing anyway in my script.

However I'm having some problems with psy-maps.

First of all, as the files downloaded from DWD opendata server do not have grid information inside (presumably to save space) I need to do a cdo setgrid before passing the file to psyplot.project.plot.mapplot function otherwise the plotting does not work. As far as I understand psyplot does not only need clat and clon from the grid but also the edges. Is it possible to use one of the more primitive methods before mapplot to pass only three 1-dimensional arrays (e.g. clat, clon and data) as done in pyplot.tricontourf? This way I can use an external file to store only the grid information as I currently do.

Second, I tried to reproduce my basic plot that you see here download (5) which I simply do like this:

dset= read_dataset(variables=['T'])
lat, lon = get_coordinates()
temp_850 = dset['t'].metpy.sel(vertical=850 * units.hPa).load()
temp_850.metpy.convert_units('degC')
fig = plt.figure(figsize=(15, 10))
ax  = plt.gca()
_, x, y = get_projection(lon, lat, "world")
cs = ax.tricontourf(x, y, temp_850[0], extend='both', cmap=get_colormap("temp"),
                                    levels=np.arange(-35., 30., 1.))

read_dataset, get_projection and get_coordinates are functions that I wrote to simplify the code. They only wrap some xarray and basemap calls. This takes about 34 seconds of wall time; the input array has dimensions time: 93, ncells: 2949120. The coordinates lat and lon are saved in a different file.

Trying to do the same with psy-maps I first created a file where I also store the entire grid information inside (not only clat and clon otherwise it doesn't work as said before)

psy.plot.mapplot('~/Downloads/temp/test_psy/T_2020093006_global_grid.nc',
                 name='t', projection='robin',
                 levels=np.arange(238, 303, 2.), cmap=get_colormap("temp"),
                 )

I get this result download (6) which takes about 50 seconds. It seems that the levels parameter is completely ignored and somehow I cannot pass the pyplot axis to mapplot to control the figure size.

Do you have any idea on how to create a plotting script that I could use to reproduce my plotting setup so that I could benchmark it and see whether it is really faster?

Right now in the documentation of psy-maps there are only examples with really small grids so I'm not sure how the packages would behave for larger grids.

Chilipp commented 3 years ago

hey @guidocioni! thanks for your input here! I am currently releasing the new version 1.3.0 for each package (psyplot, psy-simple, psy-maps and psyplot-gui) and has just been finished. I'd be happy to do some benchmarking here.

A few comments:

the default for mapplot is to draw each and every grid cell individually. So it is not making a contour plot (and is therefore ignoring the levels formatoption (see also the docs). This is why it is taking a bit longer.
You can make contour plots (for triangular or rectilinear data) via psy.plot.mapplot(plot='contourf', ...) (see the docs). I did not yet do a lot of experiments with contourf because I prefer plots on the grid cell level, and I remember that there were some issues with warping the data at the map boundaries. Nevertheless, I'd be happy to improve this further.
when you draw each and every grid cell, you need to format the color levels via the bounds formatoption. In general, bounds controls the levels of the colorbar (i.e. the norm in the plt.pcolormesh() function, for instance), and levels controls the levels of the contour plot (as you use it in plt.contourf, for instance). By default, levels is the same as bounds, but you can also tell it to be something else.
concerning your grid info: You can specify a gridfile when you use the psyplot.data.open_dataset function (which is also available via psyplot.project.open_dataset). This would merge the grid information into the xarray dataset, so you should not have to do this by yourself. Your call would then be something like
```
import psyplot.project as psy
ds = psy.open_dataset(
  '~/Downloads/temp/test_psy/T_2020093006_global_grid.nc',
  gridfile="path-to-file-with-gridcell-bounds.nc")
ds.psy.plot.mapplot(...)
```
The reason, why psyplot needs this information is to determine, whether you are plotting unstructured data, or not.
Concerning the data size:

Right now in the documentation of psy-maps there are only examples with really small grids so I'm not sure how the packages would behave for larger grids.

Yes, as the documentation is automatically built on every commit I do not want to have to large datafiles in the examples. And plotting the individual grid cells is quite time-consuming with matplotlib, as it runs on the CPU (although I did already speed it up quite a lot, as I am transforming the coordinates independently of the standard matplotlib approach, see https://github.com/psyplot/psyplot/issues/6 for instance). There is a potential for gaining a lot of speed if I'd add functionality based on VTK (see the psy-vtk project), but so far there has not been time for me to work on this further (and for some reason, the examples on mybinder don't work anymore).

Chilipp commented 3 years ago

somehow I cannot pass the pyplot axis to mapplot to control the figure size.

you can pass an ax keyword to the plotmethod, but this then needs to be initialized with a projection, e.g.

import cartopy.crs as ccrs
ax = ccrs.Robinson()
psy.plot.mapplot(ax=ax, ...)

Alternatively you can also do this afterwards, then you need to do a

sp = psy.plot.mapplot(...)
fig = sp.plotters[0].ax.figure
fig.set_figwidth(...)
fig.set_figheight(...)

guidocioni commented 3 years ago

Hey Philipp, thanks for the feedback. Let me come back to you after testing with your options. It would be really nice to benchmark psy-maps with this "large" data. But you're right, I want to use the pcolormesh kind of method to plot the data as I'm pretty sure this would be quite faster than tricontourf.

Chilipp commented 3 years ago

Hey @guidocioni, sorry, there seems to be a misunderstanding. tricontourf is the faster method as the pcolormesh kind of method (which, for unstructured data is rather a pcolor kind of method) needs to draw every individual grid cell. the pcolormesh kind of method is the default for psyplot, which is why it is slower than your small script.

guidocioni commented 3 years ago

Uhm actually from my previous tests with regular lat-lon grids I found out that pcolormesh is actually faster than contourf in the same way imshow is, as to make the contour you need to interpolate somehow the original data, while with "raster" fill methods you just have to fill the polygons with colors based on the levels. But as a matter of fact I have never tried to compare pcolormesh and tricontourf on unstructured grids as in the standard version of matplotlib if I remember well you cannot use unstructured data in pcolormesh. I'm pretty sure for large grids the overhead of having to interpolate the data to make the contour lines would overcome the "loop" that actually takes care of filling the polygons, but maybe I'm wrong :) Anyway that's exactly why I wanted to perform some benchmarks. I will try to set up a script where we can actually compare the two methods.

Chilipp commented 3 years ago

hey @guidocioni, yes, this is indeed true for regular lat-lon grids. But you cannot use pcolormesh for triangular grids (or unstructured grids in general). Here you would use tripcolor which under the hood uses pcolor. For regular grids, pcolor and pcolormesh produce the same result, but pcolormesh is much faster. However, pcolormesh only works for regular lat-lon grids and does not support the visualization of unstructured data. That is why you fall back to tripcolor (i.e. pcolor) and this is much slower than tricontourf.

PS: As mentioned before, this is how you would do it manually. In psy-maps I am avoiding the use tripcolor because the standard matplotlib way of transforming the coordinates is way to slow, and because it would only work for triangular data. However, it is still slower than the contour plot because, as for pcolor, matplotlib still needs to plot each and every gridcell (patch) individually.

psyplot / psy-maps

Plotting operational ICON data and benchmarking #24