Amortized point estimators

marvinschmitt commented 6 months ago

The paper "Likelihood-free parameter estimation with neural Bayes estimators" (Sainsbury-Dale, Zammit-Mangion, & Huser, 2023) enables neural amortized point estimation, which is generally faster than fully Bayesian neural inference with conditional generative models (current standard in BayesFlow).

Implementing amortized point estimators in BayesFlow would help with fast prototyping and sanity checks (and surely other tasks, too!).

Some BayesFlow pointers/proposals:

Implement the point estimation features in a class AmortizedPointEstimator in bayesflow.amortizers
Implement appropriate loss functions in bayesflow.losses (depending on the target quantity)
Implement a suitable inference network in bayesflow.inference_networks which takes the summary network output (e.g., DeepSet outputs) and regresses to the point estimate
Summary networks are already implemented in bayesflow.summary_networks and probably work out-of-the-box

Some links:

Paper: https://doi.org/10.1080/00031305.2023.2249522
NeuralEstimators package (R): https://github.com/msainsburydale/NeuralEstimators
NeuralEstimators.jl package (Julia): https://github.com/msainsburydale/NeuralEstimators.jl

zengjice1991 commented 6 months ago

Great proposals! I read the paper. But I am not sure if the method can generate posterior samples, which may be helpful to detect multi-modal distributions.

stefanradev93 commented 6 months ago

Yes, the method will not deal with multimodality out of the box. I think its main use case will be as a fast and frugal heuristic to obtain estimates very quickly.

paul-buerkner commented 6 months ago

Idea: can we train the summary nets based on point inference nets and then reuse the trained summary nets in combination with full posterior inference nets?

Stefan Radev @.***> schrieb am Di., 13. Feb. 2024, 12:43:

Yes, the method will not deal with multimodality out of the box. I think its main use case will be as a fast and frugal heuristic to obtain estimates very quickly.

— Reply to this email directly, view it on GitHub https://github.com/stefanradev93/BayesFlow/issues/121#issuecomment-1940324907, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2AH5AXV6TW2ZYGIOC4DYTLHMNAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBQGMZDIOJQG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

andrewzm commented 6 months ago

Multi-modality on the marginals should be detectable through the posterior quantiles. The main "compromise" is that there is no easy way to get information on posterior joint dependence generally. But as Stefan said, this is more of an "exploratory" step or an alternative for when the full NF framework is proving difficult to fit.

I like Paul's idea. Possible issue: Summaries that are good for estimating individual parameters might not be relevant/suitable for extracting information on posterior correlations (additional summaries might be needed when doing full posterior inference).

On Tue, Feb 13, 2024 at 3:00 PM Paul-Christian Bürkner < @.***> wrote:

Idea: can we train the summary nets based on point inference nets and then reuse the trained summary nets in combination with full posterior inference nets?

Stefan Radev @.***> schrieb am Di., 13. Feb. 2024, 12:43:

Yes, the method will not deal with multimodality out of the box. I think its main use case will be as a fast and frugal heuristic to obtain estimates very quickly.

— Reply to this email directly, view it on GitHub < https://github.com/stefanradev93/BayesFlow/issues/121#issuecomment-1940324907>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/ADCW2AH5AXV6TW2ZYGIOC4DYTLHMNAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBQGMZDIOJQG4>

. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/stefanradev93/BayesFlow/issues/121#issuecomment-1940387670, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABEU3LL2TL35DSD6ZYOV63YTLQMHAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBQGM4DONRXGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

stefanradev93 commented 6 months ago

I like the idea too! We could use the point summary network or the point estimates themselves as summaries, with an auxiliary, smaller summary stream for capturing additional information relevant for the joint.

andrewzm commented 6 months ago

I think using the point estimates as summary statistics will be similar to when one uses expert summary statistics, which we know can work well (e.g., the Jiang et al. Statistica Sinica paper) but probably what you have at the moment is even better. To start off with, from what I know of BayesFlow, I think we should keep the statistics network as is, and add an option for the user to choose "point estimation" instead of "full inference" -- technically all is required is to swap out the invertible NN with a much simpler dense neural network with input dimension equal to the summary statistic dimension and output dimension equal to the parameter dimension. There could even be a sensible default for the architecture of this NN. The user also needs to be able to specify a loss function when choosing "point estimation", but we could just start off with squared error loss at first just to get it running.

On Wed, Feb 14, 2024 at 12:27 AM Stefan Radev @.***> wrote:

I like the idea! We could use the point summary network or the point estimates themselves as summaries, with an auxiliary, smaller summary stream for capturing additional information relevant for the joint.

— Reply to this email directly, view it on GitHub https://github.com/stefanradev93/BayesFlow/issues/121#issuecomment-1941513743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABEU3OVKIAMWU7RDJTACD3YTNS3JAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBRGUYTGNZUGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

paul-buerkner commented 6 months ago

yes I fully agree. my idea was going in a different direction though but it would be more like research which we can do once the implementation is done

Andrew Zammit Mangion @.***> schrieb am Mi., 14. Feb. 2024, 16:10:

I think using the point estimates as summary statistics will be similar to when one uses expert summary statistics, which we know can work well (e.g., the Jiang et al. Statistica Sinica paper) but probably what you have at the moment is even better. To start off with, from what I know of BayesFlow, I think we should keep the statistics network as is, and add an option for the user to choose "point estimation" instead of "full inference" -- technically all is required is to swap out the invertible NN with a much simpler dense neural network with input dimension equal to the summary statistic dimension and output dimension equal to the parameter dimension. There could even be a sensible default for the architecture of this NN. The user also needs to be able to specify a loss function when choosing "point estimation", but we could just start off with squared error loss at first just to get it running.

On Wed, Feb 14, 2024 at 12:27 AM Stefan Radev @.***> wrote:

I like the idea! We could use the point summary network or the point estimates themselves as summaries, with an auxiliary, smaller summary stream for capturing additional information relevant for the joint.

— Reply to this email directly, view it on GitHub < https://github.com/stefanradev93/BayesFlow/issues/121#issuecomment-1941513743>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/AABEU3OVKIAMWU7RDJTACD3YTNS3JAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBRGUYTGNZUGM>

. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/stefanradev93/BayesFlow/issues/121#issuecomment-1943142265, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2AAU4KQIMG6AP4GYYPTYTRIOFAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBTGE2DEMRWGU . You are receiving this because you commented.Message ID: @.***>

paul-buerkner commented 6 months ago

and I did not mean using the point estimates as summary statistics. I agree that would not be very useful in general.

Paul Buerkner @.***> schrieb am Mi., 14. Feb. 2024, 16:51:

yes I fully agree. my idea was going in a different direction though but it would be more like research which we can do once the implementation is done

Andrew Zammit Mangion @.***> schrieb am Mi., 14. Feb. 2024, 16:10:

I think using the point estimates as summary statistics will be similar to when one uses expert summary statistics, which we know can work well (e.g., the Jiang et al. Statistica Sinica paper) but probably what you have at the moment is even better. To start off with, from what I know of BayesFlow, I think we should keep the statistics network as is, and add an option for the user to choose "point estimation" instead of "full inference" -- technically all is required is to swap out the invertible NN with a much simpler dense neural network with input dimension equal to the summary statistic dimension and output dimension equal to the parameter dimension. There could even be a sensible default for the architecture of this NN. The user also needs to be able to specify a loss function when choosing "point estimation", but we could just start off with squared error loss at first just to get it running.

On Wed, Feb 14, 2024 at 12:27 AM Stefan Radev @.***> wrote:

I like the idea! We could use the point summary network or the point estimates themselves as summaries, with an auxiliary, smaller summary stream for capturing additional information relevant for the joint.

— Reply to this email directly, view it on GitHub < https://github.com/stefanradev93/BayesFlow/issues/121#issuecomment-1941513743>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/AABEU3OVKIAMWU7RDJTACD3YTNS3JAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBRGUYTGNZUGM>

. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/stefanradev93/BayesFlow/issues/121#issuecomment-1943142265, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2AAU4KQIMG6AP4GYYPTYTRIOFAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBTGE2DEMRWGU . You are receiving this because you commented.Message ID: @.***>

ukoethe commented 6 months ago

I also think we need to do it in two steps:

train summary statistics jointly with a normalizing flow as usual
train point estimates (with an additional small network) from the frozen summary network output

Thus, only inference will potentially be faster, not training. I suspect that attemtps to train summary statistics directly via point estimates might result in poor summaries, because it will be harder for the summary network to capture the uncertainty.

ukoethe commented 6 months ago

A luxury method to train point estimates would enhance the point estimator network with additional outputs for the (co)variance of the point estimate and the fraction of the full posterior explained by the MAP mode.

These outputs could be trained by destillation from the SBI posterior: (1) Find the MAP mode in the NF posterior output => target value of the MAP point estimate. (2) Fit a Gaussian to the MAP mode (=Laplace approximation) to get the (co)variance and fraction of probability mass explained by the MAP mode.

The destillation approach has the additional benefit that the point estimator would return the true MAP estimates (subject to the accuracy of the normalizing flow) instead of the ground truth parameterizations of the training runs.

stefanradev93 commented 6 months ago

Summarizing the ideas so far:

Train point estimators instead of generative networks - simply attaches a configurable MLP to the summary network and uses some norm (L1, L2, L-Infinity, https://www.tensorflow.org/api_docs/python/tf/norm) as a loss.
- Advantage: fast and pleasant to iterate over for larger problems.
- Disadvantage: no full Bayesian inference (by design).
Train point estimators and use pre-trained summary network in tandem with a full generative network - uses a frozen summary network, possibly in combination with another, smaller and trainable summary network, to train a full workflow.
- Advantage: the first summary network is trained fast and already captures some important information. May be equivalent to good hand-crafted summary statistics.
- Disadvantage: summary statistics may not be good, speed gains diminished due to two-step approach.
Train the default approach and distil it using a point estimator - trains the full workflow (i.e., summary and generative net) and attaches a configurable MLP to the pretrained summary network, possibly learning a summary of the approximate posterior (e.g., MAP).
- Advantage: the summary statistics will already be good for learning the full posterior (assuming everything has converged just fine), so they will also be good for the point estimator. Point estimator can then be trained hyperefficiently.
- Disadvantage: speed gains diminished due to two-step approach, the utility of the point estimates is unclear if full workflow is already good.

Crucially, all three are enabled, once the functionality is there. Does anyone want to implement the blueprints Marvin suggested in the first post? Otherwise, I would self-assign in a week or two.

andrewzm commented 6 months ago

Looks good, I think 1. is where we should start as that will help establish the core benefits, and then move from there. Matt and I will be happy to provide comments and check changes/provide suggestions as they come through; can also contribute to code but probably someone who knows the internals should drive it.

On Thu, Feb 15, 2024 at 12:24 AM Stefan Radev @.***> wrote:

Summarizing the ideas so far:

Train point estimators instead of generative networks - simply attaches a configurable MLP to the summary network and uses some norm (L1, L2, L-Infinity, https://www.tensorflow.org/api_docs/python/tf/norm) as a loss.

Advantage: fast and pleasant to iterate over for larger problems.

Disadvantage: no full Bayesian inference (by design).

Train point estimators and use pre-trained summary network in tandem with a full generative network - uses a frozen summary network, possibly in combination with another, smaller and trainable summary network, to train a full workflow.

Advantage: the first summary network is trained fast and already captures some important information. May be equivalent to good hand-crafted summary statistics.

Disadvantage: summary statistics may not be good, speed gains diminished due to two-step approach.

Train the default approach and distil it using a point estimator

trains the full workflow (i.e., summary and generative net) and attaches a configurable MLP to the pretrained summary network.

Advantage: the summary statistics will already be good for learning the full posterior (assuming everything has converged just fine), so they will also be good for the point estimator. Point estimator can then be trained hyperefficiently.

Disadvantage: speed gains diminished due to two-step approach, the utility of the point estimates is unclear if full workflow is already good.

Crucially, all three are enabled, once the functionality is there. Does anyone want to implement the blueprints Marvin suggested in the first post? Otherwise, I would self-assign in a week or two.

— Reply to this email directly, view it on GitHub https://github.com/stefanradev93/BayesFlow/issues/121#issuecomment-1943760608, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABEU3MBQXPGPA7ZOIRJAVTYTS3HJAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBTG43DANRQHA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

stefanradev93 commented 5 months ago

Hi all, I have implemented and tested a first working version of an amortized point estimator, which can be found in the Development branch.

An example use case is:

import bayesflow as bf
from bayesflow.benchmarks import Benchmark

# Trivial Gaussian benchmark or a real model
benchmark = Benchmark('gaussian_linear', mode='posterior')

# An easy-to-use MLP with residual connections
inference_net = bf.helper_networks.ConfigurableMLP(input_dim=10, dropout_rate=0.05)

# Can be any norm in [1, 2, np.inf] and can also use a summary network
amortizer = bf.amortizers.AmortizedPointEstimator(inference_net, norm_ord=2)

# Training can happen as usual
trainer = bf.trainers.Trainer(amortizer=amortizer, configurator=benchmark.configurator)
data = benchmark.generative_model(5000)
h = trainer.train_offline(data, epochs=50, batch_size=32)

# Quick point estimates can be obtained by simply calling the .estimate() method
test_data = benchmark.configurator(benchmark.generative_model(100))
estimates = amortizer.estimate(test_data)

Let me know what you think.

marvinschmitt commented 5 months ago

Great, thanks! Just adding two pointers to Stefan's post:

Commit: https://github.com/stefanradev93/BayesFlow/commit/0b866a9be6a2b8bb5691dd502cb396bbf664098f
Direct link to AmortizedPointEstimator in bayesflow.amortizers: https://github.com/stefanradev93/BayesFlow/blob/65082730bd938f4068aaac62839a3ef160c55460/bayesflow/amortizers.py#L1218

paul-buerkner commented 5 months ago

The workflow looks good to me. How would you generalize the interface to different loss function for different point estimates? The norm_ord is perhaps a bit restrictive an interface for that(?)

stefanradev93 commented 5 months ago

The workflow looks good to me. How would you generalize the interface to different loss function for different point estimates? The norm_ord is perhaps a bit restrictive an interface for that(?)

The current AmortizedPointEstimator supports a loss_fun argument which takes precedence over norm_ord when provided and allows for generic loss functions. We may want to allow some string arguments which default to useful estimators that are not captured by the simple norm_ord or provide a more semantic interface (e.g., loss_fun="mean").

zengjice1991 commented 5 months ago

Great job! Does the current workflow support the uncertainty quantification for estimated parameters? Do we need apply bootstrap for UQ just like what the reference paper did?

stefanradev93 commented 5 months ago

You would currently need bootstrap or Monte Carlo dropout for UQ, unless a custom loss_fun already entails UQ.

vpratz commented 5 months ago

The interface looks good to me as well. @han-ol and I implemented the illustrative example from the paper (see #145) and the results look sensible. Feel free to take a look and to improve the notebook, maybe it can serve as the foundation of a more detailed tutorial notebook on the method.

zengjice1991 commented 5 months ago

The interface looks good to me as well. @han-ol and I implemented the illustrative example from the paper (see #145) and the results look sensible. Feel free to take a look and to improve the notebook, maybe it can serve as the foundation of a more detailed tutorial notebook on the method.

Can I ask the link of this notebook? I can look into and learn it. Thanks!

marvinschmitt commented 5 months ago

The interface looks good to me as well. @han-ol and I implemented the illustrative example from the paper (see #145) and the results look sensible. Feel free to take a look and to improve the notebook, maybe it can serve as the foundation of a more detailed tutorial notebook on the method.

Can I ask the link of this notebook? I can look into and learn it.

Thanks!

https://github.com/stefanradev93/BayesFlow/blob/Development/examples/Amortized_Point_Estimation.ipynb

han-ol commented 5 months ago

Hi all, the current interface supports loss functions of the form $\mathcal L=f(\hat \theta - \theta)$. This covers many of the interesting loss functions. However, my understanding is

the interval loss function
quantile loss function for multiple quantiles
and others (e.g. for Uncertainty Quantification: http://arxiv.org/abs/2203.09168)

cannot abide this form.

To support losses that are not simply functions of the difference $\hat \theta - \theta$ I suggest we change the current signature of the custom loss_fun from loss_fun(net_output - paramters) to loss_fun(net_output, paramters).

The implementation replaces the tf.norm with a wrapper that takes two arguments, so the convenient functionality of just choosing norm_ord stays untouched and the illustrative notebook runs just as before. Not sure if we want this wrapper and if it is well placed in losses.py.

Let me know what you think!

andrewzm commented 5 months ago

Sorry for the delay -- this is really nice, thanks @stefanradev93 for the implementation and @vpratz @han-ol for the easy-to-follow notebook, very cool!

I think this is already great as is, from a user point of view I agree with @han-ol that it would be helpful to have a generic loss interface and possibly also some "template" losses one could use. One possibility is to provide some template losses (e.g., normed_difference, quantile_loss, etc.) that take in net_output and parameters and then have the additional arguments specific to that loss (e.g., power of the normed difference, quantile to target, etc.). Users can of course specify their own losses but the templates would cover 95% of use cases.

The last thing which as a user will be really handy is a function like amortizer.bootstrap() that can do the bootstrap for you. The easiest would be parametric bootstrap where the amortizer.estimate() is run with the obs data to give you the estimate, the simulator is run at that estimate to give you an N-sized bootstrap sample (assuming the simulator is available and that the training data are not fixed beforehand), and then amortizer.estimate() is run again on that sample. This would be useful for all loss functions... e.g., one might want to do get a bootstrap sample of 95% credible intervals.

Anyway, really nice stuff, very happy to see it implemented in BayesFlow!

marvinschmitt commented 5 months ago

Agreed, great job everyone!

@andrewzm

and possibly also some "template" losses one could use. [...] the templates would cover 95% of use cases.

Based on your expertise and experience with amortized point estimators, would you mind compiling a list of template losses that might cover the vast majority of users' needs?

andrewzm commented 5 months ago

I would say some "experience" not "expert", but I'm fairly confident that the vast majority of cases will be covered by one of the following losses:

Normed difference loss (with a user-chosen value for the exponent, default to 2).
Quantile loss (where the user has to specify the quantile $q$) -- we tested this out in a neural-network setting in Sainsbury-Dale et al. (2024) and it works very well.

Other useful losses include:

Weighted squared-error loss: $L(\theta, \hat\theta(Z); w(\cdot)) = w(\theta)(\hat\theta(Z) - \theta)^2$ where $w(\theta)$ is a user-defined weighting function of $\theta$, e.g., $w(\theta) = \frac{1}{\theta^2}$ (Robert, 2007, Corollary 2.5.2).
The loss function $L(\theta, \hat\theta(Z); \tau) = ( \hat\theta(Z) - \theta^\tau)^2$. Such a loss function would be useful for deriving higher order moments of the posterior distribution. For example, setting $\tau = 2$, combining this with an estimator for the posterior mean, one can then compute the posterior variance $$\textrm{var}(\theta \mid Z) = \textrm{E}(\theta^2 \mid Z) - \textrm{E}(\theta \mid Z)^2.$$ This requires training two networks though, one for the posterior expectation of the parameter and one for the posterior expectation of the square of the parameter... this can lead to negative values when estimating the variance in this way. If the loss can be made to ingest outputs from another neural network, then one could use the more "robust" formulation of Fan & Yao (1998) who propose the loss: $$L(\theta, \hat\theta(Z)) = ((\theta - \textrm{E}(\theta \mid Z))^2 - \hat\theta(Z))^2.$$ for which the Bayes estimator $\hat\theta(Z) = \textrm{var}(\theta \mid Z)$ (so positivity can be easily forced through the neural network architecture).

There are other losses one might consider, e.g., Stein's loss for covariance matrices which might be useful (one would need to go through the Cholesky factor to ensure positive definiteness of $\hat\Sigma$ in the NN architecture) and Huber losses, although I don't know without further reading what Bayes estimators these would lead to. For starters I would put in the normed difference, quantile loss and the weighted squared-error loss.

Fan, J., & Yao, Q. (1998). Efficient estimation of conditional variance functions in stochastic regression. Biometrika, 85(3), 645-660. Robert, C. P. (2007). The Bayesian Choice. New York: Springer. Sainsbury-Dale, M., Richards, J., Zammit-Mangion, A., & Huser, R. (2023). Neural Bayes estimators for irregular spatial data using graph neural networks. arXiv preprint arXiv:2310.02600.

stefanradev93 / BayesFlow

Amortized point estimators #121