Typehints for injected functionality

lewisjared commented 10 months ago

Describe the bug

Type hints for the injected functions on a ScmRun are not properly handled. Pycharm has no idea that lineplot is an available function which is a pretty poor developer experience.

With large container classes we want to provide a range of behaviour without having to lump it all in the same module which is difficult to understand

Currently we inject functionality for:

netcdf io
xarray io
arithemetic operations
plotting

We also inherit an "OpsMixin" baseclass which correctly sets some magic methods

There are a few ways of achieving this:

Move to a model similar to xarray which uses inheritance to bring together functionality https://github.com/pydata/xarray/blob/e2b6f3468ef829b8a83637965d34a164bf3bca78/xarray/core/arithmetic.py#L122

Pandas doesn't use inheritance in the same way, but simply has a huge frame.py file

Use composition to perform the same action. That might work well for plotting as it seems to be the common pattern used across xarray and pandas. e.g. df.plot.line or ds.plot.scatter. This might cause a large change to the API without some relatively opaque redirection. See https://github.com/pandas-dev/pandas/blob/d04747c367d00ee03c5d008ce5670892d450e801/pandas/core/series.py#L5782 for pandas accessors
Add the IO functions to BaseScmRun, removing the need for the injection. The lower level functionality can happen elsewhere in scmdata.netcdf, but the method knows what to call.

In reality, the soln could be a mixture of all three. plotting = composition arithmetic = inheritance xarray and NC = methods

@mikapfl Any other suggestions or thoughts?

Failing Test

Pycharm autocompletes lineplot with the correct type hint information

znicholls commented 10 months ago

In reality, the soln could be a mixture of all three

This seems pretty sensible. I assume that there's no good way to avoid inheritance for the arithmetic (I have always found mix-ins very opaque but having a huge definition file also seems painful so maybe it is the best tradeoff here)?

mikapfl commented 10 months ago

For nice developer experience with primap2, I did some horrible things with pyi stubs: https://github.com/pik-primap/primap2/blob/main/primap-stubs.patch

However, since we are controlling the classes here, we probably don't have to do anything like that. The problem in primap2 is that we are providing new methods on xarray classes directly, using the xarray mechanism to register that. Honestly, the amount of weird stuff you have to do to support the pythonic API way of "method on class" makes me question if it is really worth it. Is

a = run.filter(category=x)

really so much better than

import scmdata as sd

a = sd.filter(run, category=x)

to warrant the added complexity?

I know everyone likes the pipe-style of chaining multiple things, so run.a().b().c() instead of c(b(a(run))), but still, in the primap2 developer docs there is a section on how to add a function. You'd hope that you don't have to write 2000 characters on how to add a function.

But enough philosophical musings, back to the topic at hand: your proposal looks sensible. There is a nice hierarchy of "core" (inheritance) to less "core" (composition, methods) functionality. I'd maybe opt for a huge class with all the arithmetic over inheritance because MixIns are really super opaque, but either way is a solid compromise.

znicholls commented 10 months ago

I know everyone likes the pipe-style of chaining multiple things, so run.a().b().c() instead of c(b(a(run)))

I would pay a stupidly high price to get chaining. I don't know why, it is just so much easier to reason for me.

mikapfl commented 10 months ago

Fair enough. Maybe the reasonable compromise is: pay for chaining for functionality which chains well (e.g. filter), but maybe don't pay for chaining for functionality which is a dead end (e.g. plotting). So the norm for plotting would be:

data = run.a().b().c()
sd.lineplot(data, plotting_args)

I honestly think that is even better than

run.a().b().c().lineplot(plotting_args)

simply because you can set breakpoints on the data = line and inspect the data to plot, or make another plot with the same data etc.

znicholls commented 10 months ago

Yep smart

On Fri, Sep 8, 2023 at 10:18 AM Mika Pflüger @.***> wrote:

Fair enough. Maybe the reasonable compromise is: pay for chaining for functionality which chains well (e.g. filter), but maybe don't pay for chaining for functionality which is a dead end (e.g. plotting). So the norm for plotting would be:

data = run.a().b().c()sd.lineplot(data, plotting_args)

I honestly think that is even better than

run.a().b().c().lineplot(plotting_args)

simply because you can set breakpoints on the data = line and inspect the data to plot, or make another plot with the same data etc.

— Reply to this email directly, view it on GitHub https://github.com/openscm/scmdata/issues/253#issuecomment-1711269511, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFUH5G2K6BEWPNWCFEG4ZYTXZLIG7ANCNFSM6AAAAAA4PXVKHY . You are receiving this because you commented.Message ID: @.***>

openscm / scmdata

Typehints for injected functionality #253