Open AuguB opened 1 year ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
@AuguB, the example I've included in our discussion (DelayedSaturatedMMM) is an Hierarchical model using coordinates. Just in my opinion defining a property and its dedicated setter isn't justifiable when a lot of models simply don't use them.
I'm consulting your concerns regarding the fit_data with other PyMC devs. I'm not sure tho, if the case you're talking about is very frequent one, and in a scenario where data privacy could be an issue, you can simply erase the fit_data by calling
Also not that much related to the topic, but pymc-marketing is already integrated with this API, so any change done here would result in quite an effort there to adjust (models themselves, tests, example notebooks, tutorials), so probably the separation of generating data based on the input dataset, and preprocessing will be addressed in the future, but this PR shouldn't be merged, unless those models would be adjusted. Nevertheless thank you for your effort with preparing this PR, it shows where we can focus our efforts in the future to make ModelBuilder API more user-friendly
@AuguB, the example I've included in our discussion (DelayedSaturatedMMM) is an Hierarchical model using coordinates. Just in my opinion defining a property and its dedicated setter isn't justifiable when a lot of models simply don't use them. I'm consulting your concerns regarding the fit_data with other PyMC devs. I'm not sure tho, if the case you're talking about is very frequent one, and in a scenario where data privacy could be an issue, you can simply erase the fit_data by calling
.idata.fit_data = None, and only then save the model. The id generated by model builder, which is later on used for making sure that the details specifying the model were properly preserved to allow for replication don't include fit_data, so the privacy issues are addressed, with no consequences to the model or it's ability to load.
I don't see how omitting the fit_data from the idata would have no consequences for the model's ability to load. I see a clear dependence on this data in the example here.
dataset = idata.fit_data.to_dataframe()
X = dataset.drop(columns=[model.output_var])
y = dataset[model.output_var].values
model.build_model(X, y)
But besides that, to be honest I think that shipping the full training dataset with a model just to make sure the computation graph is correctly constructed feels a bit cumbersome.
you're right, I completely missed it while writing the response. @ricardoV94 , @twiecki do you see this as an issue that will impact a lot of usual business cases? If the data privacy is an edge case I don't think it should be addressed here, but by overloading the load function in the child class that would be defined for that specific customer
Also not that much related to the topic, but pymc-marketing is already integrated with this API, so any change done here would result in quite an effort there to adjust (models themselves, tests, example notebooks, tutorials), so probably the separation of generating data based on the input dataset, and preprocessing will be addressed in the future, but this PR shouldn't be merged, unless those models would be adjusted. Nevertheless thank you for your effort with preparing this PR, it shows where we can focus our efforts in the future to make ModelBuilder API more user-friendly
Yes, I understand that this is an issue. I would love to see this pattern be used in a later version, for now I will just use my fork.
Thanks for all the helpful comments.
@AuguB it is generally better to first open an issue to explain what functionality is missing/broken before jumping to a PR. This will ensure there is consensus before work is invested on your part.
Anyway, about having coords stuff by default, I think that is totally fine. We promote coords as part of a good workflow.
About the privacy issues. That is interesting, but not really unique to ModelBuilder? When you do pm.sample
by default the observed data (and any Constant/MutableData) are stored in the arviz trace. This is considered good practice for "reproducibility".
I agree with @michaelraczycki that the user should manually delete the observed data if they want to get rid of it (or use a subclass of ModelBuilder that does that).
RE: Model coords. Was it the case that loaded models would lose coords information?
RE: Model coords. Was it the case that loaded models would lose coords information?
Nope, just the implementation of coords and their preservation I left for child classes, as part of the generate_and_preprocess_data (coords being one of the generation parts). I kinda did it on purpose, because I believe that if half of the models won't be using hierarchical models, we shouldn't force them to implement, or inherit methods that are purely there to take care of them
once again, for Hierarchical models I'm treating DelayedSaturatedMMM as an example of implementation
I don't follow what hierarchical models have to do with coords. You can (and should) use coords for any kind of model
okey, in that case we can move the parameter definition to MB, but the setter function should be an abstract one here
@AuguB it is generally better to first open an issue to explain what functionality is missing/broken before jumping to a PR. This will ensure there is consensus before work is invested on your part.
Apologies, I see you started a discussion (not issue) on that. Ignore that comment :)
okey, in that case we can move the parameter definition to MB, but the setter function should be an abstract one here
By abstract you mean an abc.abstractmethod
(which must be implemented by every subclass)?
Aren't the coords saved in the fitted idata though? Why is something extra needed?
About the privacy issues. That is interesting, but not really unique to ModelBuilder? When you do pm.sample by default the observed data (and any Constant/MutableData) are stored in the arviz trace. This is considered good practice for "reproducibility". I agree with @michaelraczycki that the user should manually delete the observed data if they want to get rid of it (or use a subclass of ModelBuilder that does that).
Yes, the data is stored by default, but removing the data from the trace would break the modelbuilder.load as it is now.
RE: Model coords. Was it the case that loaded models would lose coords information?
No, the coords information is still there. I agree that it is suboptimal to force the user to inherit methods that aren't used (and I don't think we should force the user to implement it by making it abstract). However, right now, if the user has privacy-sensitive data with coords that may not overlap completely (like me), he is forced to overload fit, predict, save, and load. That, to me, seems like a bit much just to avoid the save_coords
method.
Aren't the coords saved in the fitted idata though? Why is something extra needed?
Because the coords dict will be needed in build_model
called by fit
, and I thought it would be useful to store the dict explicitly along with the model config, so that it is loaded and re-used in build_model
called by predict
, without forcing the user to write the boilerplate code that extracts them from the idata.
I may be wrong again (mostly dealing with implementation, not the actual usage) but you shouldn't build model again if you want to make predictions. _data_setter method should be implemented using pymc.set_data function, which replaces the data stored within the model variables
I don't follow what hierarchical models have to do with coords. You can (and should) use coords for any kind of model
You're right. The point is that the predict_data
may not have instances on all coords that were in the fit_data
, so to correctly load the model it will need to access the coords that were in the train data. In the current version of modelbuilder, the model is loaded using the train data itself, but that is not possible in all applications (I work in the medical domain). In my opinion, the overhead of saving the coords independently of the data is acceptable to enable this use case.
I may be wrong again (mostly dealing with implementation, not the actual usage) but you shouldn't build model again if you want to make predictions. _data_setter method should be implemented using pymc.set_data function, which replaces the data stored within the model variables
Not always, but I do fit
-> save
-> load
-> predict
, and then I have to build the model again in load
Okey, now I get it. Regarding the setter function, in my opinion it's easy: either it's common enough that we should force every class to implement it, if it's not so commonly used to justify the abstract part of it, it doesn't belong here. In SOLID principles, the 3rd states that you should be able to replace any method defined by the parent function with a Childs and it should not cause problems
Derived classes must be substitutable for their base classes. A subclass should enhance (or at least not reduce) the behavior of the parent class. It's not just about method signatures; the behavior of the subclass should align with the expectations set by the parent class.
So in this case the setter method here (without a concrete implementation) would not fulfil those requirements, as the replacement from subclass would change the outcome.
I'll try to get the PR making the coords a property of ModelBuilder this week, and try to address the load issue, to make it possible to load a model which data has been wiped out. Any thoughts on that @ricardoV94 ? I can include something more, probably we'd make a release afterwards
Okay, great.
One related thing that has not been adressed, and which guided my thinking in this issue, is the following: I want to apply the same preprocessing on the predict_data
as I did on the fit_data
. So the steps for the two methods (assuming the model has been saved and loaded intermediately) are:
for fit:
preprocess -> save coords -> build model -> sample
for predict:
preprocess -> build model -> sample
Deciding against save_coords
, the user would have to write two cases in preprocess
: one in which the coords are extracted from the fit_data
, and one in which the coords are extracted from the idata
. Saving the model_coords
would take away the case distinction.
in that case please put the preprocessing logic in _data_setter, it should contain everything needed to successfully replace the data in your model for predictions. I understand your reasoning, but you're talking about 2 different edge cases that can't all be addressed here. Even with other pymc models we've integrated MB with so far each of them needed overloading some functions, simply because it's almost impossible to make a template class that will address every single scenario, unless you want to enforce a lot of stuff down the road.
My only vague issue with saving coords separately is that there are now two sources for the same information? This can eventually become a subtle source of bugs if/when they diverge (e.g., different results in arviz plots and summary statistics). Usually this is a bad practice, although generalities don't always apply.
Can't we have a default self.extract_coords
? that retrieves them from self.idata
by default? A user can then override this method if they don't have all the coords in idata for whatever reason (and that user-defined method can retrieve them from the config of course). Otherwise I would by default use InferenceData to carry this information because that's in a sense the standard vehicle.
For example, if arviz changes how coords are stored internally we don't have to do anything in ModelBuilder. But if we are now responsible for encoding/decoding coords we are duplicating this work and potentially diverging from arviz.
Somewhat related, some methods like sample_posterior_predictive
(used in predict
) will behave differently if the model has MutableData
variables that are missing or have changed from the idata
(it will resample intermediate variables from the prior instead of using the trace posterior). The same happens for mutable_coords
. So you have to be quite careful if you want to do predictions with an artificially mutated dataset.
This can certainly be overcome, but just wanted to raise that concern so you don't get surprised down the road.
in that case please put the preprocessing logic in _data_setter, it should contain everything needed to successfully replace the data in your model for predictions. I understand your reasoning, but you're talking about 2 different edge cases that can't all be addressed here. Even with other pymc models we've integrated MB with so far each of them needed overloading some functions, simply because it's almost impossible to make a template class that will address every single scenario, unless you want to enforce a lot of stuff down the road.
But wouldn't that mean that we have the same preprocessing logic in preprocess_data
and _data_setter
?
Can't we have a default
self.extract_coords
? that retrieves them fromself.idata
by default? A user can then override this method if they don't have all the coords in idata for whatever reason (and that user-defined method can retrieve them from the config of course). Otherwise I would by default use InferenceData to carry this information because that's in a sense the standard vehicle.
I agree that it would be cleaner to read the coords from the idata
, so I think extract_coords
is a good idea. But I also believe we will need to separate preprocess
from save_coords
. We could cover both cases with something like this:
def extract_coords(self, X, y):
if self.is_fitted:
self.model_coords = # Extract coords from idata
else:
self.model_coords = self.extract_coords_from_data(X)
where extract_coords_from_data
is abstract.
Then for both fit
and predict
, we would have preprocess -> extract coords -> build model -> sample
Sounds reasonable at first glance. I would probably pass both X
and y
but that's a minor detail at this point.
where extract_coords_from_data is abstract.
I would make it raise NotImplementedError
not abstract (so far we haven't needed it, so I wouldn't require every subclass to implement it)
Before we go down that road, it would be good to check this actually solves your problem? See my message above about sample_posterior_predictive
and missing/modified mutable data/coords from the idata.
Somewhat related, some methods like
sample_posterior_predictive
(used inpredict
) will behave differently if the model hasMutableData
variables that are missing or have changed from theidata
(it will resample intermediate variables from the prior instead of using the trace posterior). The same happens formutable_coords
. So you have to be quite careful if you want to do predictions with an artificially mutated dataset.This can certainly be overcome, but just wanted to raise that concern so you don't get surprised down the road.
I'm not quite sure if I understand exactly what you mean. Do you have an example where this happens?
Somewhat related, some methods like
sample_posterior_predictive
(used inpredict
) will behave differently if the model hasMutableData
variables that are missing or have changed from theidata
(it will resample intermediate variables from the prior instead of using the trace posterior). The same happens formutable_coords
. So you have to be quite careful if you want to do predictions with an artificially mutated dataset. This can certainly be overcome, but just wanted to raise that concern so you don't get surprised down the road.I'm not quite sure if I understand exactly what you mean. Do you have an example where this happens?
If you have a model parameter that depends on mutable data or has mutable dims that are missing (or have changed value), then those will be resampled from the prior, as if you had not learned anything about them.
import pymc as pm
with pm.Model() as m:
x = pm.MutableData("x", 0)
a = pm.Normal("a", x)
b = pm.Normal("b", a, observed=[100] * 10)
idata = pm.sample()
pm.sample_posterior_predictive(idata) # Sampling: [b]
del idata.constant_data["x"]
pm.sample_posterior_predictive(idata) # Sampling: [a, b]
I would make it raise
NotImplementedError
not abstract (so far we haven't needed it, so I wouldn't require every subclass to implement it)
Wouldn't this then raise the error for every subclass that hasn't implemented it, forcing the user to implement, having the same effect as making it abstract? I think so far it wasn't needed because the coords were extracted in generate_and_preprocess_model_data
.
Wouldn't this then raise the error for every subclass that hasn't implemented it, forcing the user to implement, having the same effect as making it abstract? I think so far it wasn't needed because the coords were extracted in generate_and_preprocess_model_data.
Ah okay, I missed that
If you have a model parameter that depends on mutable data or has mutable dims that are missing (or have changed value), then those will be resampled from the prior, as if you had not learned anything about them.
Right, makes sense, thanks. I don't believe I ran into this issue before, but good to know. I think I will just do idata.constant_data['X'] *= 0
Right, makes sense, thanks. I don't believe I ran into this issue before, but good to know. I think I will just do
idata.constant_data['X'] *= 0
When you rebuild the model X
must also be set to zero (it has to match or the same thing will happen). This is an edge case, most models don't have mutable data/coords on the inferred parameters.
Anyway, setting everything to zero (or nan) may be a good way to get around the privacy issues without having to remove things.
A suggestion for changing the way modelbuilder saves and loads the model.
The two issues this addresses is: