Fix modelbuilder load - Githubissues

AuguB commented 1 year ago

A suggestion for changing the way modelbuilder saves and loads the model.

The two issues this addresses is:

In hierarchical models, the predict data may contain a subset of the groups in the fit data, so the coords of the fit data need to be stored, so that the indices of the groups in the predict data can be matched.
The fit data can not be saved in the model, as this could lead to privacy issues if models are shared.

review-notebook-app[bot] commented 1 year ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

michaelraczycki commented 1 year ago

@AuguB, the example I've included in our discussion (DelayedSaturatedMMM) is an Hierarchical model using coordinates. Just in my opinion defining a property and its dedicated setter isn't justifiable when a lot of models simply don't use them. I'm consulting your concerns regarding the fit_data with other PyMC devs. I'm not sure tho, if the case you're talking about is very frequent one, and in a scenario where data privacy could be an issue, you can simply erase the fit_data by calling .idata.fit_data = None, and only then save the model. The id generated by model builder, which is later on used for making sure that the details specifying the model were properly preserved to allow for replication don't include fit_data, so the privacy issues are addressed, with no consequences to the model or it's ability to load.

michaelraczycki commented 1 year ago

Also not that much related to the topic, but pymc-marketing is already integrated with this API, so any change done here would result in quite an effort there to adjust (models themselves, tests, example notebooks, tutorials), so probably the separation of generating data based on the input dataset, and preprocessing will be addressed in the future, but this PR shouldn't be merged, unless those models would be adjusted. Nevertheless thank you for your effort with preparing this PR, it shows where we can focus our efforts in the future to make ModelBuilder API more user-friendly

AuguB commented 1 year ago

@AuguB, the example I've included in our discussion (DelayedSaturatedMMM) is an Hierarchical model using coordinates. Just in my opinion defining a property and its dedicated setter isn't justifiable when a lot of models simply don't use them. I'm consulting your concerns regarding the fit_data with other PyMC devs. I'm not sure tho, if the case you're talking about is very frequent one, and in a scenario where data privacy could be an issue, you can simply erase the fit_data by calling .idata.fit_data = None, and only then save the model. The id generated by model builder, which is later on used for making sure that the details specifying the model were properly preserved to allow for replication don't include fit_data, so the privacy issues are addressed, with no consequences to the model or it's ability to load.

I don't see how omitting the fit_data from the idata would have no consequences for the model's ability to load. I see a clear dependence on this data in the example here.

        dataset = idata.fit_data.to_dataframe()
        X = dataset.drop(columns=[model.output_var])
        y = dataset[model.output_var].values
        model.build_model(X, y)

But besides that, to be honest I think that shipping the full training dataset with a model just to make sure the computation graph is correctly constructed feels a bit cumbersome.

michaelraczycki commented 1 year ago

you're right, I completely missed it while writing the response. @ricardoV94 , @twiecki do you see this as an issue that will impact a lot of usual business cases? If the data privacy is an edge case I don't think it should be addressed here, but by overloading the load function in the child class that would be defined for that specific customer

AuguB commented 1 year ago

Also not that much related to the topic, but pymc-marketing is already integrated with this API, so any change done here would result in quite an effort there to adjust (models themselves, tests, example notebooks, tutorials), so probably the separation of generating data based on the input dataset, and preprocessing will be addressed in the future, but this PR shouldn't be merged, unless those models would be adjusted. Nevertheless thank you for your effort with preparing this PR, it shows where we can focus our efforts in the future to make ModelBuilder API more user-friendly

Yes, I understand that this is an issue. I would love to see this pattern be used in a later version, for now I will just use my fork.

Thanks for all the helpful comments.

ricardoV94 commented 1 year ago

@AuguB it is generally better to first open an issue to explain what functionality is missing/broken before jumping to a PR. This will ensure there is consensus before work is invested on your part.

Anyway, about having coords stuff by default, I think that is totally fine. We promote coords as part of a good workflow.

About the privacy issues. That is interesting, but not really unique to ModelBuilder? When you do pm.sample by default the observed data (and any Constant/MutableData) are stored in the arviz trace. This is considered good practice for "reproducibility".

I agree with @michaelraczycki that the user should manually delete the observed data if they want to get rid of it (or use a subclass of ModelBuilder that does that).

ricardoV94 commented 1 year ago

RE: Model coords. Was it the case that loaded models would lose coords information?

michaelraczycki commented 1 year ago

RE: Model coords. Was it the case that loaded models would lose coords information?

Nope, just the implementation of coords and their preservation I left for child classes, as part of the generate_and_preprocess_data (coords being one of the generation parts). I kinda did it on purpose, because I believe that if half of the models won't be using hierarchical models, we shouldn't force them to implement, or inherit methods that are purely there to take care of them

michaelraczycki commented 1 year ago

once again, for Hierarchical models I'm treating DelayedSaturatedMMM as an example of implementation

ricardoV94 commented 1 year ago

I don't follow what hierarchical models have to do with coords. You can (and should) use coords for any kind of model

michaelraczycki commented 1 year ago

okey, in that case we can move the parameter definition to MB, but the setter function should be an abstract one here

ricardoV94 commented 1 year ago

@AuguB it is generally better to first open an issue to explain what functionality is missing/broken before jumping to a PR. This will ensure there is consensus before work is invested on your part.

Apologies, I see you started a discussion (not issue) on that. Ignore that comment :)

ricardoV94 commented 1 year ago

okey, in that case we can move the parameter definition to MB, but the setter function should be an abstract one here

By abstract you mean an abc.abstractmethod (which must be implemented by every subclass)?

ricardoV94 commented 1 year ago

Aren't the coords saved in the fitted idata though? Why is something extra needed?

AuguB commented 1 year ago

About the privacy issues. That is interesting, but not really unique to ModelBuilder? When you do pm.sample by default the observed data (and any Constant/MutableData) are stored in the arviz trace. This is considered good practice for "reproducibility". I agree with @michaelraczycki that the user should manually delete the observed data if they want to get rid of it (or use a subclass of ModelBuilder that does that).

Yes, the data is stored by default, but removing the data from the trace would break the modelbuilder.load as it is now.

RE: Model coords. Was it the case that loaded models would lose coords information?

No, the coords information is still there. I agree that it is suboptimal to force the user to inherit methods that aren't used (and I don't think we should force the user to implement it by making it abstract). However, right now, if the user has privacy-sensitive data with coords that may not overlap completely (like me), he is forced to overload fit, predict, save, and load. That, to me, seems like a bit much just to avoid the save_coords method.

Aren't the coords saved in the fitted idata though? Why is something extra needed?

Because the coords dict will be needed in build_model called by fit, and I thought it would be useful to store the dict explicitly along with the model config, so that it is loaded and re-used in build_model called by predict, without forcing the user to write the boilerplate code that extracts them from the idata.

michaelraczycki commented 1 year ago

I may be wrong again (mostly dealing with implementation, not the actual usage) but you shouldn't build model again if you want to make predictions. _data_setter method should be implemented using pymc.set_data function, which replaces the data stored within the model variables

AuguB commented 1 year ago

I don't follow what hierarchical models have to do with coords. You can (and should) use coords for any kind of model

You're right. The point is that the predict_data may not have instances on all coords that were in the fit_data, so to correctly load the model it will need to access the coords that were in the train data. In the current version of modelbuilder, the model is loaded using the train data itself, but that is not possible in all applications (I work in the medical domain). In my opinion, the overhead of saving the coords independently of the data is acceptable to enable this use case.

AuguB commented 1 year ago

I may be wrong again (mostly dealing with implementation, not the actual usage) but you shouldn't build model again if you want to make predictions. _data_setter method should be implemented using pymc.set_data function, which replaces the data stored within the model variables

Not always, but I do fit -> save -> load -> predict, and then I have to build the model again in load

michaelraczycki commented 1 year ago

Okey, now I get it. Regarding the setter function, in my opinion it's easy: either it's common enough that we should force every class to implement it, if it's not so commonly used to justify the abstract part of it, it doesn't belong here. In SOLID principles, the 3rd states that you should be able to replace any method defined by the parent function with a Childs and it should not cause problems

Derived classes must be substitutable for their base classes. A subclass should enhance (or at least not reduce) the behavior of the parent class. It's not just about method signatures; the behavior of the subclass should align with the expectations set by the parent class.

So in this case the setter method here (without a concrete implementation) would not fulfil those requirements, as the replacement from subclass would change the outcome.

I'll try to get the PR making the coords a property of ModelBuilder this week, and try to address the load issue, to make it possible to load a model which data has been wiped out. Any thoughts on that @ricardoV94 ? I can include something more, probably we'd make a release afterwards

AuguB commented 1 year ago

Okay, great.

One related thing that has not been adressed, and which guided my thinking in this issue, is the following: I want to apply the same preprocessing on the predict_data as I did on the fit_data. So the steps for the two methods (assuming the model has been saved and loaded intermediately) are:

for fit: preprocess -> save coords -> build model -> sample for predict: preprocess -> build model -> sample

Deciding against save_coords, the user would have to write two cases in preprocess: one in which the coords are extracted from the fit_data, and one in which the coords are extracted from the idata. Saving the model_coords would take away the case distinction.

michaelraczycki commented 1 year ago

in that case please put the preprocessing logic in _data_setter, it should contain everything needed to successfully replace the data in your model for predictions. I understand your reasoning, but you're talking about 2 different edge cases that can't all be addressed here. Even with other pymc models we've integrated MB with so far each of them needed overloading some functions, simply because it's almost impossible to make a template class that will address every single scenario, unless you want to enforce a lot of stuff down the road.

ricardoV94 commented 1 year ago

My only vague issue with saving coords separately is that there are now two sources for the same information? This can eventually become a subtle source of bugs if/when they diverge (e.g., different results in arviz plots and summary statistics). Usually this is a bad practice, although generalities don't always apply.

Can't we have a default self.extract_coords? that retrieves them from self.idata by default? A user can then override this method if they don't have all the coords in idata for whatever reason (and that user-defined method can retrieve them from the config of course). Otherwise I would by default use InferenceData to carry this information because that's in a sense the standard vehicle.

For example, if arviz changes how coords are stored internally we don't have to do anything in ModelBuilder. But if we are now responsible for encoding/decoding coords we are duplicating this work and potentially diverging from arviz.

ricardoV94 commented 1 year ago

Somewhat related, some methods like sample_posterior_predictive (used in predict) will behave differently if the model has MutableData variables that are missing or have changed from the idata (it will resample intermediate variables from the prior instead of using the trace posterior). The same happens for mutable_coords. So you have to be quite careful if you want to do predictions with an artificially mutated dataset.

This can certainly be overcome, but just wanted to raise that concern so you don't get surprised down the road.

AuguB commented 1 year ago

in that case please put the preprocessing logic in _data_setter, it should contain everything needed to successfully replace the data in your model for predictions. I understand your reasoning, but you're talking about 2 different edge cases that can't all be addressed here. Even with other pymc models we've integrated MB with so far each of them needed overloading some functions, simply because it's almost impossible to make a template class that will address every single scenario, unless you want to enforce a lot of stuff down the road.

But wouldn't that mean that we have the same preprocessing logic in preprocess_data and _data_setter?

Can't we have a default self.extract_coords? that retrieves them from self.idata by default? A user can then override this method if they don't have all the coords in idata for whatever reason (and that user-defined method can retrieve them from the config of course). Otherwise I would by default use InferenceData to carry this information because that's in a sense the standard vehicle.

I agree that it would be cleaner to read the coords from the idata, so I think extract_coords is a good idea. But I also believe we will need to separate preprocess from save_coords. We could cover both cases with something like this:

def extract_coords(self, X, y):
    if self.is_fitted:
        self.model_coords = # Extract coords from idata
    else:
        self.model_coords = self.extract_coords_from_data(X)

where extract_coords_from_data is abstract.

Then for both fit and predict, we would have preprocess -> extract coords -> build model -> sample

ricardoV94 commented 1 year ago

Sounds reasonable at first glance. I would probably pass both X and y but that's a minor detail at this point.

where extract_coords_from_data is abstract.

I would make it raise NotImplementedError not abstract (so far we haven't needed it, so I wouldn't require every subclass to implement it)

Before we go down that road, it would be good to check this actually solves your problem? See my message above about sample_posterior_predictive and missing/modified mutable data/coords from the idata.

AuguB commented 1 year ago

Somewhat related, some methods like sample_posterior_predictive (used in predict) will behave differently if the model has MutableData variables that are missing or have changed from the idata (it will resample intermediate variables from the prior instead of using the trace posterior). The same happens for mutable_coords. So you have to be quite careful if you want to do predictions with an artificially mutated dataset.

This can certainly be overcome, but just wanted to raise that concern so you don't get surprised down the road.

I'm not quite sure if I understand exactly what you mean. Do you have an example where this happens?

ricardoV94 commented 1 year ago

Somewhat related, some methods like sample_posterior_predictive (used in predict) will behave differently if the model has MutableData variables that are missing or have changed from the idata (it will resample intermediate variables from the prior instead of using the trace posterior). The same happens for mutable_coords. So you have to be quite careful if you want to do predictions with an artificially mutated dataset. This can certainly be overcome, but just wanted to raise that concern so you don't get surprised down the road.

I'm not quite sure if I understand exactly what you mean. Do you have an example where this happens?

If you have a model parameter that depends on mutable data or has mutable dims that are missing (or have changed value), then those will be resampled from the prior, as if you had not learned anything about them.

import pymc as pm

with pm.Model() as m:
    x = pm.MutableData("x", 0)
    a = pm.Normal("a", x)
    b = pm.Normal("b", a, observed=[100] * 10)

    idata = pm.sample()
    pm.sample_posterior_predictive(idata)  # Sampling: [b]
    del idata.constant_data["x"]
    pm.sample_posterior_predictive(idata)  # Sampling: [a, b]

AuguB commented 1 year ago

I would make it raise NotImplementedError not abstract (so far we haven't needed it, so I wouldn't require every subclass to implement it)

Wouldn't this then raise the error for every subclass that hasn't implemented it, forcing the user to implement, having the same effect as making it abstract? I think so far it wasn't needed because the coords were extracted in generate_and_preprocess_model_data.

ricardoV94 commented 1 year ago

Wouldn't this then raise the error for every subclass that hasn't implemented it, forcing the user to implement, having the same effect as making it abstract? I think so far it wasn't needed because the coords were extracted in generate_and_preprocess_model_data.

Ah okay, I missed that

AuguB commented 1 year ago

If you have a model parameter that depends on mutable data or has mutable dims that are missing (or have changed value), then those will be resampled from the prior, as if you had not learned anything about them.

Right, makes sense, thanks. I don't believe I ran into this issue before, but good to know. I think I will just do idata.constant_data['X'] *= 0

ricardoV94 commented 1 year ago

Right, makes sense, thanks. I don't believe I ran into this issue before, but good to know. I think I will just do idata.constant_data['X'] *= 0

When you rebuild the model X must also be set to zero (it has to match or the same thing will happen). This is an edge case, most models don't have mutable data/coords on the inferred parameters.

Anyway, setting everything to zero (or nan) may be a good way to get around the privacy issues without having to remove things.

pymc-devs / pymc-experimental

Fix modelbuilder load #229