[proposal] A unifying interface for likelihood subsampling

Red-Portal commented 1 year ago

Hi,

In the short future, I think the Turing ecosystem will start to look into stochastic gradients, for example, for doubly stochastic variational inference and stochastic gradient MCMC. But unfortunately, there isn't an elegant way to interact with Turing models to for subsampling the likelihood. Therefore, I think a good middle ground would be LogDensityProblems providing an interface for data subsampling, where Turing could implement it when the time comes. I'm thinking of an interface like follows:

  idx_range = LogDensityProlems.datapoint_index(prob)
  idx_batch = StatsBase.sample(idx_range, batch size, replace=true)
  prob = LogDensityProblems.update_batch(prob, idx_batch)
  LogDensityProblems.minibatch_logdensity(prob, theta)

The reason I think the index range of the datapoints must be exposed is that some recent inference algorithms need to control the content of the batch. For example, stochastic gradient descent methods with reshuffling are known to converge faster, and some stochastic gradient MCMC methods occasionally need to compute the full batch.

A separate function for the minibatch, such as minibatch_logdensity, would be needed since the likelihood strength needs to be adjusted.

Any comments or concerns would be much appreciated!

devmotion commented 1 year ago

Can you explain why this has to be handled by LogDensityProblems? In my experience so far, LogDensityProblems is not concerned with how you implement and define your problem - you just have to make sure that LogDensityProblems.logdensity and other required traits are defined.

Couldn't downstream packages just define a

struct MyProb
 ....
end

subsample(rng, ::MyProb) = MyProb(...)

and then call

LogDensityProblems.logdensity(subsample(....), theta)

instead of LogDensityProblems.logdensity(::MyProb, theta) directly? Data can be provided in many different ways, using different types, as iterators or arrays, etc., and generally to me such design choices seem a bit orthogonal to what I perceive are LogDensityProblem's goals.

But maybe @tpapp has a different opinion?

Red-Portal commented 1 year ago

I think the current direction of Turing is to use LogDensityProblems as an intermediate representation of probabilistic models such that Turing provides a log density problem, and various inference libraries receive a log density problem. Wouldn't the use of a separate subsample routine break this abstraction? I also think whether a model can be subsampled is some sort of a trait. In that view, wouldn't it be aligned with what LogDensityProblems is trying to achieve?

torfjelde commented 1 year ago

intermediate representation of probabilistic models

I don't think it's meant to be an abstract for probabilistic models; I think it's just meant to be an abstraction for (potentially unnormalized) densities. That does not necessarily involce any data.

Wouldn't the use of a separate subsample routine break this abstraction?

I'm with @devmotion on this. If, say, subsample is part of AdvancedVI, you just say to the user "you have to make sure your type implements subsample" :shrug: As long as that method is defined in "your" package, no type piracy occurs.

tpapp commented 1 year ago

I concur with @devmotion and @torfjelde on this issue. From the proposal it is not clear to me why the API for subsampling needs to be in this package. A type can implement both this interface and any other interface necessary --- eg the one that involves subsample.

Red-Portal commented 1 year ago

Okay I understand your points. Thanks for the discussions!

tpapp / LogDensityProblems.jl

[proposal] A unifying interface for likelihood subsampling #106