pm.Model(coords=coords) should preserve coordinate type if it is a pandas Index

pymc-devs / pymc

Bayesian Modeling and Probabilistic Programming in Python

https://docs.pymc.io/

Other

8.69k stars 2.01k forks source link

pm.Model(coords=coords) should preserve coordinate type if it is a pandas Index #5994

Closed aseyboldt closed 2 years ago

aseyboldt commented 2 years ago

We currently convert all coordinates to tuples, but this removes valuable information, if the coordinates already are a pandas Index.

I think we should only convert the input values if they are not already a pandas index, and if they are not we should wrap them in an pd.Index instead of a tuple. This is also immutable, but preserves the meaning of a coordinate.

import pymc as pm
import pandas as pd

coords = {
    "dim": pd.date_range(start="2022-01-01", end="2022-02-01")
}

with pm.Model(coords=coords) as model:
    pm.Normal("x")

type(model.coords["dim"])  # -> tuple
# We lost the valuable information from the original pandas `DatetimeIndex`.

Relevant PR that changed this: #5061

michaelosthege commented 2 years ago

I'm not convinced. What valuable information exactly is lost? Can you give an example?

Also, with the current API (tuples) one can do pmodel.coords["city"].index("London") which is incredibly useful when building long-form models. A pd.Index doesn't have .index(), so using pandas indices would be a major breaking change.

In addition, remember that the coords become coords in the InferenceData/xarray, so any "useful information" that a pd.Index could hold would get lost at that later point.

If you can give examples of useful information that should be captured for a coord, we should consider adding it here:

https://github.com/michaelosthege/mcbackend/v0.1.1/2706f8570e395bdcc006111911f6716d73ecc5ee/protobufs/meta.proto#L40-L46

aseyboldt commented 2 years ago

Fair warning up front: I feel strongly about this, I hope this doesn't get too preachy, and if it does I'm sorry ;-)

Just to get the model.coords["something"].index("London") out of the way: you can do that with an index just as well, but safer, and with more functionality:

About the lost info: A pandas (or xarray, for the most part it just reuses pandas) index is more than the set of values. A few examples:

The simplest example I guess would be the pd.RangeIndex, which works a lot like the individual values, but it is smart enough to not actually store the values. In get_loc it also just returns the correct value without any searching.
CategoricalIndex takes advantage of the fact that often the same string in an index appears more than once. So it keeps a list of all strings that could appear around (in the categorical dtype) and stores just integers for the index itself. This way it speeds up a lot of operations and saves memory. It also means that it can handle the case where a category contains values that do not appear in the index at all. Say we have four cities in our complete dataset, but the subset we are using in a model happens to only contain three of those. This information would be captured in CategoricalIndex (if used correctly at least). (By the way, pd.factorize is really useful when working with those, we get the Index and the indices into this index for usage in the model at the same time).
All the time related index stuff in pandas: DatetimeIndex, PeriodIndex, TimedeltaIndex. They all somehow use time information, but store it efficiently and also contain a lot of context information: Are we talking about a time point, a duration etc, do all days count, or only work days, timezone etc. This info is also used during plotting.

I guess the reason why I feel that strongly is that I usually put quite a bit of thought into what kind of Index I use for each dimension. If all that is then just turned into a tuple that doesn't know anything about what it represents, all that is useless.

michaelosthege commented 2 years ago

Sorry for the delayed response, I was pretty busy this week...

No doubt that pd.Index is really useful for people who know the tricks, but I'm worried that it could create a some really nasty headaches in other places.

First of all, I'm investing a lot of thought into standardizing the metainformation around models/MCMCs in a way that is PPL and language-agnostic. This has many benefits including things like live-streaming traces, RVs with varying shapes and much more. I'd be happy to elaborate in a call or something. But this also means that we should think datastructures independent of pandas or xarray. I'm sure that's possible for pd.Index too, and IMO we should start there. After all, if a pd.Index really holds so much more information than a tuple of its values, we'd have to store a bunch of additional fields.

Is the xarray.Index a feature subset/superset of the pd.Index?

The current implementation just does values = tuple(values) which should preserve Datetime/Period/Timedelta and so on. Memory efficiency of coords at runtime is super irrelevant compared to all the rest.

So it keeps a list of all strings that could appear around (in the categorical dtype) and stores just integers for the index itself. [...] It also means that it can handle the case where a category contains values that do not appear in the index at all. Say we have four cities in our complete dataset, but the subset we are using in a model happens to only contain three of those.

What does this mean for the dims in a PyMC model? What if it's a MultiIndex? Are xarray and ArviZ compatible with masked-out indexes?

And most people don't know about this feature, which makes it super confusing that pd.Index.get_level_values() can sometimes return a superset of set(index.values).

aseyboldt commented 2 years ago

Ok, I guess I'm now starting to understand why you prefer to just have tuples, and I can see that in the context of mcbackend that might make a few things easier. I don't think it is a good idea though to limit (or remove, this broke a bunch of my old models I was porting to pymc4) functionality because that is more convenient for an external project.

It seems to me we already have something like a standard for traces, namely arviz. It sure isn't perfect (eg I'd love to have hierarchical variables, for which I guess we'd first need something like https://github.com/pydata/xarray/issues/4118), but overall I'm pretty happy with it. If you think you can improve on it and write a different one, that's great, but it'll take a lot of great features to make me use something based on protocol buffers when I can have xarray...

But this also means that we should think datastructures independent of pandas or xarray.

Again, I don't think I want to go from pandas and xarray to protocol buffers. Pandas and xarray sure have their flaws but for pretty much everything I might want to do except maybe streaming traces pandas and xarray seem like a way better fit. And I don't care about streaming traces a lot to be honest.

aseyboldt commented 2 years ago

Hm, maybe that was a bit harsh...

I actually think mcbackend looks pretty nice, it's just that I'm not so sure it should be the default, and I don't like to get worse interoperability with pandas and xarray because of it.

If you think live preview of the trace while it is sampling is important, I think we can find a solution for that, that is much less involved. I'll play around with nutpie in that regard a little, I think the infrastructure there should make it pretty easy to experiment there.

If we want to send the trace to a remote machine, then using protobuf to send the updates might be a good solution, but I don't think this needs to be the default. For that usecase we could also always just pickle the index object if that makes it easier.

michaelosthege commented 2 years ago

If you think live preview of the trace while it is sampling is important, I think we can find a solution for that, that is much less involved.

This is already possible even without changes to PyMC, but we can reduce complexity in pm.sampling a lot by replacing BaseTrace with Backend/Run/Chain.

McBackend is essentially a cleaned-up refactoring of pm.backends.base.BaseTrace:

mcbackend.Backend is the equivalent of the weakly typed thing that you can pass to pm.sample(trace=...)

mcbackend.Backend.init_run() is the equivalent of BaseTrace.setup() and mcbackend.Run.create_chain() the equivalent of calling _choose_backend(), but with a cleaned-up execution order: In PyMC we call _init_trace() once per chain, which in the default case instantiates a NDArray, thereby compiling initial point functions in the BaseTrace.__init__ per chain 😬

In McBackend the order is:

Create a Run with RunMeta information about the variables, shapes, dtypes, coords in the model (the RunMeta class is autogenerated from the protobuf) and thereafter
Run.create_chain() implementations can simply handover the RunMeta information to the Chain object.

So for the initialization of the storage backend, McBackend takes the good things from pm.BaseTrace and cleans up the mess that we have in pm.sampling.*.

And then there's conversion to InferenceData: Starting from the standardized Backend/Run/Chain interface, and most importantly because metadata about model variables is passed to RunMeta before starting the MCMC this is pretty straightforward. Meanwhile, PyMC has >1000 lines of code just for InferenceData conversion and corresponding tests. And I'm not even counting the PPL-specific converters in the ArviZ codebase.

michaelosthege commented 2 years ago

Oh and just to be clear, I think our intentions are well aligned:

Making working with dims, coords more capable & robust
Promoting good Bayesian workflow practices (with and without PyMC)
Making our shared implementations more capable such that people with special needs (slow models, memory-intensive draws, chain restarting...) write fewer custom solutions