Decoupling inference and model

reubenharry commented 5 months ago

OK, so I now have a slightly better understanding of the resampling buffer, and so I thought it would be useful to turn the discussion about the particle filter where the model and inference are on different rates into an issue.

So far, I'll just state my understanding of the problem, to make sure we have common ground, and going forward we can discuss ideas building on top of it.

To make things concrete, here's some code:

mainRhineMultiRate =
  modelRhine
        >-- someBuffer -->
          inferenceRhine

(Here, I've removed the dependence on temperature, and the visualization, just because they are orthogonal to the discussion).

Now it seems reasonable to want inferenceRhine to tick at a significantly lower rate than modelRhine, since it is more computationally intensive (it basically has to run n versions of modelRhine in parallel). What then happens is that modelRhine produces m observations (i.e. sensor readings) in the time it takes inferenceRhine to tick once. The question is how the resampling buffer should combine these into 1 observation. Options include:

the resampling buffer actually doesn't do anything (i.e. collect), and passes on all of the observations with their times. It is then the job of the model to receive all of them: I think this just pushes the problem down the road
the resampling buffer performs some sort of average. However, at least the naive version of this isn't really sensible, because inferenceRhine believes the observations all come from (what it believes the current position of the latent variable is, and that cannot be in most of the positions that the latent on the high rate clock of modelRhine suggests)

Abstractly, I think what we want is for an interpolation based on the time of the observations. That is, say that the m observations are collected between times t1 and t2 of the clock of inferenceRhine. We want to view an given observation as a time-weighted average of observations drawn from the position of the latent at t1 and t2.

reubenharry commented 5 months ago

OK, I thought a little bit more, and I have a conjecture about a sensible thing to do here. Rather than describe implementation details, I'll just describe the concept.

As above, suppose observations are produced by a fast clock stochastic process (with observation noise, also fast clocked). For concreteness, let's imagine this is a random walk on the plane, and the noise is Gaussian. Let's say the clock ticks every millisecond.

Let's say the inference is a particle filter running at a tick per second.

Let's suppose that the resampling buffer collects 1000 observations per second, and passes them on to the inference procedure, along with their timestamp (can we do this?). The question is then: how should the collection of particles in the particle filter use these 1000 observations to do reweighting?

Each particle represents a path of the random walk (discretized to one jump per second), but we have access to the past. So for each particle, and a given observation (a tuple of timestamp and a position), at a time $t$ (by the clock of the slow process) we can calculate a weight as follows: first take the time of the observation, and map it onto a value $p \in [0,1]$, where $p=1$ if the timestamp is at $t$ and $0$ if it's at $t_{prev}$ (i.e. the time at the previous tick of the slow clock). The draw from bernoulli p. If the result is True, set the weight to be normalPdf x std obs, where x is the current position, and otherwise, set it to be normalPdf xprev std obs, where xprev is the position at the previous tick.

This way, observations that happen closer in time to the previous tick are more likely to reweight the previous position (smoothing) and ones nearer the current tick reweight the present position.

Conceptually, the idea is that, from the perspective of the slower process, the observations' timestamp is uncertain. The true "latent" timestamp must be one of the ticks of the slow process, but some noise has altered the time.

turion commented 4 months ago

Now it seems reasonable to want inferenceRhine to tick at a significantly lower rate than modelRhine, since it is more computationally intensive

I doubt this assumption. Let me try to explain.

A model of a purely physical system is best described as a StochasticProcess, which implies that it has no clock. I believe that a clock in a model only makes sense when either

Some part of the system can be modelled/simplified as a clock, e.g. describing a decaying nucleus or some discrete system (queueing, decision processes, ...) with a Poisson clock (#238).
Some part of the system is also a Rhine program, e.g. if we have agents whose controlling program is (or can be approximated by) a Rhine program.

In the case of the random walk, this is not the case. There is no intrinsic property of the system we describe that forces us to simulate the model at a particular rate. It's an implementation detail that depends on the machine we're running it on, and on our accuracy requirements. The same goes for the inference. We tune the number of particles and the inference rate such that we use all of the available computing power. In practice, this means simulating just a bit faster than 25 fps so that the discreteness of the simulation is not visible when rendered on the screen, and using all the remaining power for inference.

I don't see how feeding the inference multiple values from different past time steps can help. After all, to process each of these values, we need to do a computation, so I believe handling 1000 ticks from the last second in one inference step will not be much faster as running the inference step a 1000 times and handle only one value each time. If I understand your angle correctly, you're saying that if I handle 1000 ticks at once, I only need to do one simulation step, but I still need to do a 1000 calculations of the PDF for weighting. So there is still the same scaling, it's just that I don't need to do both simulation and weighting. Maybe you're on a good track here and in fact simulation will always/usually be the bottle neck. But I don't see why it should.

The idea with different weights for older values sounds plausible in principle, but I find it ad hoc. You're basically saying "the latent position has diffused since the older measurement, so the older value describes it with less precision, hence we give it a weaker weight". This makes a lot of sense qualitatively, but quantitatively it should depend on the particular stochastic process. For example if the temperature is very low, the position will not move much, and then a measurement from a second ago is nearly as good as one from right now. Also, there is a vague hidden assumption that the process is something like a Levy process with increments with expected value 0, i.e. older values have no systematically different expected value than newer values. But this is already not the case for our example: The particle has a velocity, so if the particle moves e.g. to the right then older values will systematically estimate a value too far to the left.

So, as a bottom line, I believe that if there is no necessity to introduce a clock into the model, one shouldn't, and instead run everything as fast as possible such that buffering won't be needed.

reubenharry commented 4 months ago

Thanks for writing this!

I agree with you that (except for the situations you mentioned) it is natural for the model to be a behavior (i.e. unclocked). However, what I'm wondering about is this:

In practice, our particle filter has to run at some rate, which is limited by speed concerns (especially if the number of particles is large). Let's say that for a given problem, it ticks as fast as it can, which is at 10 Hertz. But suppose that observations arrive at a much higher rate (e.g. around 1000 Hertz). How should the inference procedure deal with the 100 samples (all arriving at different times) that it receives since its last tick?

Perhaps this is a non-problem for many common settings, since we can just run the inference fast enough.

Re the adhoc-ness, those are good points, I agree.

turion commented 4 months ago

In such a situation, when we really don't have the computational power to process all observations, I guess we have to throw some away. This is what's done with the keepLast buffer. Basically, putting the inference on the Busy clock and using keepLast will deal with this situation by dropping all older observations. We don't really have any good clock-specific back pressure mechanism, i.e. slowing the inference clock down when the computation isn't done yet. The Busy clock will go on ticking, and only be slowed down if the whole Rhine slows down.

reubenharry commented 4 months ago

OK, I think we're more or less in agreement except for 2 points:

It seems to me that in cases where inference isn't easy, computational power really is a bottleneck, because accuracy increases with the size of the population. So my thinking is: if there is a clever way to save compute by running inference slowly (e.g. by integrating an ODE or SDE with a large step size), that is desirable. Here, the costly compute is not incorporating the observations, but rather actually evolving the state of the system forward (imagine a high dimensional differential equation, with costly gradient computations, for example - I have such an example I'm working with presently).
While there may not necessary be a universal solution, it seems possible that there are things one can do, either in the resampling buffer or after. For instance, in cases where continuity holds in the trajectory of the latent variable, we'd expect that there are things we can do.

turion commented 4 months ago

I can definitely see the motivation. Still, when I try to incorporate a measurement from a past timestamp, then it is about conditioning the state of that time. The straightforward solution is to make sure that the simulation does a step at that timestamp, hence stepping for each input.

If somehow want to bypass that I think there cannot be a good general solution, but there may be solutions that use a property of a particular process. Maybe I can't simulate the whole process, but that particular likelihood simplifies.

E.g. for the brownian motion I know that the variance grows linearly with time, which gives me the factor by which I need to weight each observation: it will be something like 1 / (sigma^2 + t * temperature) where sigma is the error of the sensor. But this is a property of that particular process that I need to derive externally, and I can't do that in general. I also don't know what kind of properties one would generally need to know about the process. Unless this is clear, I don't know how to proceed.

turion / rhine

Decoupling inference and model #281