scipp / sciline

Build scientific pipelines for your data
https://scipp.github.io/sciline/
BSD 3-Clause "New" or "Revised" License
10 stars 2 forks source link

Add `compute_mapped` #170

Closed SimonHeybrock closed 4 months ago

SimonHeybrock commented 5 months ago

This fixes a big UX hurdle when computing results that use mapped nodes.

We will later on consider if this should be integrated more tightly with Pipeline.compute, for now this should serve as a solution that is 80% there an can be used to gather more insights/experience.

SimonHeybrock commented 5 months ago

Ok for a prototype.

This is not meant to be a prototype.

And I would also prefer not using pandas if possible because users might not use pandas either.

We could use a plain dict, but then we do not have index names (aka dims), so one would need to implement a custom data structure for this, including ultimately something like a multi-index (or, simpler something N-D like Scipp, but that won't work any more as soon as Cyclebane supports groupby). I don't think users would prefer using a custom class from an unknown library over a very well known and package in widespread use. Do you have something else in mind?

But in the long run, I would prefer merging this with compute.

It is not yet clear to me that this is a good idea, so I refrained from expressing a preference at this point. Pipeline.compute can compute multiple keys (returning a dict, instead of a single value). So we would need to return a dict containing one or more pandas.Series (instead of a single Series). Would it be too confusing to get a dict, a Series, or a dict containing Series, depending on the args? It is starting too look like a bad interface. Maybe the methods for one target (or a single mapped target) should be separate from the one computing multiple targets? Edit: Furthermore, I had to add an optional index_names argument, in case there are multiple mapped nodes with the same name, which would be cumbersome to integrate into compute with multiple targets.