[JOSS] Minimal pandas support

calebweinreb commented 5 months ago

Re JOSS review openjournals/joss-reviews#6889

There currently appears to be a mismatch between the level of pandas support implied by the JOSS manuscript and that which is actually available in the scores package. (Apologies in advance if I have overlooked any pandas options already in the package).

Currently, support for pandas is advertised at several points in the JOSS manuscript, including the title, the compatibility section, and as an advantage over existing packages (e.g., xskillscore).
However as far as I can tell, the pandas support remains truly minimal at this point. Of the 50 metrics supplied by scores, only 3 are supported in pandas (MAE, MSE, RMSE), and all of them are routine calculations that could be obtained through many other packages. Furthermore, with the exception of "is_angular", many of the useful option in scores (weighting, preserve/reduce) are not afforded through the pandas API.
Though the JOSS manuscript does describe the pandas support as "expanding", this feels somewhat misleading given how little is currently implemented.

I think this could be addressed in two ways.

One option would be to deemphasize pandas support in the JOSS manuscript, e.g., by removing it from the title and in the comparison to alternative packages. Of course, it's awesome that pandas support is on the way -- and this should definitely mentioned -- but in the mean time, one would want to avoid pandas users reading the JOSS paper and falsely concluding that they can use the more non-trivial metrics offered by scores.
Another option would be to temporarily shoehorn pandas support into the rest of the API by wrapping the existing metrics with functions that convert (pandas -> xarray) on the input side and (xarray -> pandas) on the output side (or something like that).

tennlee commented 5 months ago

@calebweinreb Thanks, this is fair. The path of least resistance for now would be to de-emphasise what is in the JOSS paper. However, I would like feedback on one other intermediate option which I’ll outline here.

I have previously looked into the merits of a basic conversion approach versus a more integrated approach to the xarray/pandas APIs, but I don’t think I documented my findings. I have therefore included some brief notes further below. The very short summary is that using a basic conversion approach mostly works but leaves too many rough edges and results in a poor user experience.

In addition to the options for JOSS publication you identified, something else that’s doable would be:

Create a tutorial notebook which sets out how to convert to/from xarray most effectively and avoid issues
Provider a helper conversion method which handles the most common issues, so that a great many of the use cases could be handled in a simple fashion while more fulsome support for pandas is being developed, without requiring the user to do their own conversions entirely on their own.

I would be interested in your thoughts on whether the following changes would be sufficient for the paper: (a) Create a tutorial notebook and helper conversion function as described above (b) Change “expanding support” to “support for pandas is being introduced” (c) Leave the title as is

It is totally reasonable if that’s not seen as sufficient - it’s important pandas users don’t have a negative experience when looking at the package, and the paper shouldn’t overstate things. However I’d appreciate your feedback on that option. If that’s not sufficient, I would be inclined to then remove pandas from the title.

The following are some notes about the merits of a basic conversion approach versus a more integrated approach to the xarray/pandas APIs:

xarray provides some functionality for transforming both to and from pandas, which would form the basis of the pandas support. However, a few things prevented this from meeting all the requirements. None of these are ‘showstoppers’, but would all need to be carefully tested to ensure a quality result.

Examples include: (1) The Pandas series objects which make up the columns in a data frame didn’t always have the “name” set to the header of the column. As a result, sometimes parts of the data structure end up with an unexpected name. This has flow-on effects to accessing, labelling and plotting the data. (2) The default dimension names for new data structures differ between xarray and pandas. This can similarly result in unexpected names for things.

I also have some uncertainty about handling pandas data structures which might have duplicate or semi-duplicate entries when converted to an xarray type. This is unlikely to be a showstopper, but will require use case testing, creating useful error messages, and likely some documentation.

Originally, the hope was to have polymorphic methods in the main API (i.e. you can pass either a data frame or an xarray object into any method) however this became a challenge for type hinting and inheritance. Xarray recommend creating “accessor methods” (https://tutorial.xarray.dev/advanced/accessors/01_accessor_examples.html) rather than creating new classes which inherit from xarray types. Some experimentation revealed that inheritance from xarray types is not straightforward.

As a result, it was a cleaner implementation and also clearer for users to provide a separate API for pandas. This would also allow any different treatments that might be wanted for data frames to be handled there.

tennlee commented 5 months ago

I spent some time yesterday looking into the intermediate option that I suggested. I think it would still end up being a bit rushed, so I would prefer to de-emphasise the role of pandas in the paper until we have a more fulsome implementation. I'll update the paper accordingly and let you know when the update is ready.

tennlee commented 5 months ago

@calebweinreb we have merged a change to the paper into 528-joss-paper-submission which we believe addresses your feedback. Please let us know whether this seems suitable.

calebweinreb commented 5 months ago

The new draft looks great and totally addresses my concern.

nci / scores

[JOSS] Minimal pandas support #577