guilgautier / DPPy

Python toolbox for sampling Determinantal Point Processes
https://dppy.readthedocs.io
MIT License
219 stars 52 forks source link

Update exact_sampling.rst #40

Closed feynmanliang closed 5 years ago

feynmanliang commented 5 years ago

Clarify chain rule is for correlation functions (not marginal probabilities)


This change is Reviewable

guilgautier commented 5 years ago

Hi @feynmanliang,

Thanks for reaching out! I'd be happy to modify the docs to make things clearer :) Can you detail what you found wrong or unclear in the chain rule for projection DPPs ?

Below is a attempt to clarify the reason of the = instead of the C. More generally, I should take some time to refresh the documentation using a simpler flow...


For a projection DPP(K), samples have (a.s.) identical size r = rank(K). To describe the chain rule one must express the joint distribution of a vector of points (x_1, \dots, x_r) forming a potential sample i.e. a set S = {x_1, \dots, x_r}

Applying Proposition 19 of HKPV06 to the finite case yields that: If one can sample (x_1, \dots, x_r) ~ 1/r! det K_S, then one can forget the order in which the points x_1, x_2, \dots x_r were drawn and consider S as a valid sample of DPP(K)

Now, going somehow backwards, since the set S has cardinality r (the size of any sample X of this DPP), one can write det K_S = P[X C S] = P[X = S], since it is invariant by permutation, the joint density of (x_1, \dots, x_r) is indeed 1/r! det K_S

feynmanliang commented 5 years ago

Thanks for explaining :). Yes, I agree P[X C S] = P[X = S] because of equation (9) modulo indicator functions. So nothing is incorrect; my suggestion is just to clarify in a way that is more clear (for me at least) where the unique properties of projection processes (rank K correlation function equals marginal density) shows up.

The chain rule is in terms of marginal probabilities P(x_1,...,x_r C S) rather than joint probabilities P(X = S). More specifically, the Schur complement formula is giving the probability P(x1,...,x{r-1},x_r C S | x1,...,x{r-1} C S) since the other conditional is trivial: P(x1,...,x{r-1},x_r = S | x1,...,x{r-1} = S) = 0. This chain rule holds for all finite DPPs: it's just that for projection DPPs the rank K joint density equals the rank K correlation function which justifies sampling by performing iterative bottom-up sampling and terminating after rank K iterations.

It happened recently (https://arxiv.org/pdf/1903.03571.pdf Appendix D, citing https://arxiv.org/pdf/1609.06840.pdf equation (2), citing HKPV06) that the projection DPP requirement required for this formula was omitted. I suspect notation (using P(x_1,...,x_r) to denote P(x_1,...,x_r C S) which equals P(x_1,...,x _r = S) only for projection DPPs) contributed to this, and that using P(x_1,...,x_r C S) for the formulas could have helped clarify there.

guilgautier commented 5 years ago

Thanks for explaining :). Yes, I agree P[X C S] = P[X = S] because of equation (9) modulo indicator functions. So nothing is incorrect; my suggestion is just to clarify in a way that is more clear (for me at least) where the unique properties of projection processes (rank K correlation function equals marginal density) shows up.

Agreed, thank you again for pointing this out :) I am eager to make things clear, to avoid any confusion and make sure the DPP model is fully preserved.

Please read what follows and let me know what you think would be better to write :)

  1. The chain rule is in terms of marginal probabilities P(x_1,...,x_r C S) rather than joint probabilities P(X = S). More specifically, the Schur complement formula is giving the probability P(x1,...,x{r-1},x_r C S | x1,...,x{r-1} C S)
  2. since the other conditional is trivial: P(x1,...,x{r-1},x_r = S | x1,...,x{r-1} = S) = 0.
  3. This chain rule holds for all finite DPPs: it's just that for projection DPPs the rank K joint density equals the rank K correlation function which justifies sampling by performing iterative bottom-up sampling and terminating after rank K iterations.
  1. To me the chain rule as described by [HKPV06] is on joint probabilities for a projection correlation kernel K A sequential draw yields a vector; there is an order. To be allowed to forget the order and consider the points you've sampled as a set with the appropriate distribution, one must show that the distribution of the vector is exchangeable i.e. invariant by permutation. Otherwise one may fall in the same trap as the paper of HeGa16 you are mentioning bellow.

  2. I'm sorry I don't understand what you mean

  3. Could you clarify what you mean by "applies to all DPPs" and by "bottom up sampling" ? To me,

    • applying the chain rule blindly to any (valid) correlation kernel K is wrong in the general and this is the pitfall in which HeGa16 fell
    • bottom up sampling corresponds exactly to what Pou19 does (efficiently) by exploring the binary tree of possibilities: (i) take item 1 ? Yes/No with proba P[1 \in X] = K_11 (ii) take item 2 given 1 was selected ? Yes/No with proba P[2 \in X | 1 \in X] = K_22 - K_12K_21/K_11 cf Proposition 3 given 1 was not selected ? Yes/No with proba P[2 \in X | 1 not \in X] = K_22 - K_12K_21/(1-K_11) cf Proposition 4 (...) take item N, given ...

It happened recently (https://arxiv.org/pdf/1903.03571.pdf Appendix D, citing https://arxiv.org/pdf/1609.06840.pdf equation (2), citing HKPV06) that the projection DPP requirement required for this formula was omitted. I suspect notation (using P(x_1,...,x_r) to denote P(x_1,...,x_r C S) which equals P(x_1,...,x _r = S) only for projection DPPs) contributed to this, and that using P(x_1,...,x_r C S) for the formulas could have helped clarify there.

I agree, totally! @rbardenet and I had noticed that HeGa16 was wrong, for some time now. See also GhRe18-arxiv or the ICML19 version GhRe19 Distortions of the DPP model by the ML community was one motivation for us to set up this toolbox along with the documentation, where we do our best to make accurate statements and to provide some research code. In particular, here is an attempt to explain why the chain rule only works for projection DPPs (and projection k-DPPs), this was written way before https://github.com/guilgautier/DPPy/commit/3da6d17a4875fa996d469acf8d06cc6d26ea798b#diff-3e642918a95c6c1b52ac0098e1fad542

FYI, the first version of [BuRaWi19] has been revised after some discussions with the authors at ICML19

HeGa16 provide only a heuristic that seem to perform well in practice (this was confirmed by the first version of [BuRaWi19]) but applying blindly the chain rule doesn't yield a valid DPP nor k-DPP sample. Note that it doesn't even appear on the first author's webpage.

feynmanliang commented 5 years ago
  1. To me the chain rule as described by [HKPV06] is on joint probabilities for a projection correlation kernel K A sequential draw yields a vector; there is an order. To be allowed to forget the order and consider the points you've sampled as a set with the appropriate distribution, one must show that the distribution of the vector is exchangeable i.e. invariant by permutation. Otherwise one may fall in the same trap as the paper of HeGa16 you are mentioning bellow.

Yes, I agree. I'm now understanding there are two separate things at play here:

  1. I'm sorry I don't understand what you mean

The event S = {x1,...,x{r-1},x_r } and the event S = {x1,...,x{r-1}} are mutually exclusive, so conditioning on the second makes the first have probability zero. I brought this up to make the point that the chain rule is for the r-point correlation function and not the joint probability mass function (P(S = {x_1,...,x_r})).

Could you clarify what you mean by "applies to all DPPs" and by "bottom up sampling" ?

"applies to all DPPs": you can always write the decomposition P({x_1,...,x_r} C S) = P({x_1} C S) P({x_1, x_2} C S | {x_1} C S) ... P({x_1, ..., x_r} C S | {x1, ..., x{r-1}} C S) which for a DPP becomes a product of telescoping Schur complement terms.

"bottom up sampling": sorry, this is bad word choice from me. This refers to starting with an empty set and adding points (as opposed to top-down / reverse iterative sampling in DW18). I hadn't seen the Pou19 reference, but it looks paths in their tree correspond to both inclusions and exclusions (i.e. bit-vector set representations) and require O(n) steps until terminating whereas the chain rule iterative sampling requires exactly r steps.

FYI, the first version of [BuRaWi19] has been revised after some discussions with the authors at ICML19

Thank you. I will let my colleagues know.

BTW, I'm happy to do an initial pass at revising this part of the documentation if you are happy with it and would like help :)

guilgautier commented 5 years ago
  • Exchangability is what allows us to state that the r! possible sampled sequences corresponding to the same set (from applying the chain rule sequentially) all have the same probability. This is due to telescoping in the product of conditional probabilities (i.e. the ratio of Schur complement terms, theorem 2.2 proof in DW18) which holds for any DPP

That's only one tip of the iceberg! For projection kernel K, if you write the n-th conditional that appears in the chain rule as, see also (14),

P[xn | S{n-1} ] = 1/(r-(n-1)) det K{S{n-1}+xn} / det K{S_{n-1}}

  1. you are mentioning that prod det K{S{n-1}+xn} / det K{S_{n-1}} = det K_S is invariant by permutation
  2. But more importantly, cf the Caution section the normalizing constant of P[xn | S{n-1} ] is equal to r-(n-1) which is independent of the previous sampled points S_{n-1}. So the their product is equal to 1/r! which is in turn independent of the final set of points S

Finally, combining 1. AND 2. proves exchangeability of the distribution of (x_1, ..., x_r) But for generic K, one must show that 2. holds, and a priori this is not the case cf Caution section

  1. I'm sorry I don't understand what you mean

The event S = {x1,...,x{r-1},x_r } and the event S = {x1,...,x{r-1}} are mutually exclusive, so conditioning on the second makes the first have probability zero. I brought this up to make the point that the chain rule is for the r-point correlation function and not the joint probability mass function (P(S = {x_1,...,x_r})).

This is still fuzzy to me especially because the event S = {x1,...,x{r-1}} has probability 0 and the r-point correlation coincides with the joint density for projection kernels

Thank you. I will let my colleagues know.

See also the email I sent you yesterday ;)

BTW, I'm happy to do an initial pass at revising this part of the documentation if you are happy with it and would like help :)

I'm glad you give me some feedback and I appreciate you help a lot! Feel free to contribute and propose revisions of the docs :) I am actually quite busy with writing unit tests ...

Btw, I haven't worked enough with PRs to know if you are able to amend this one or if it's better that I accept or reject it and you open a new one ?

guilgautier commented 5 years ago

Hi @feynmanliang πŸ™‚ Any news about this PR ?

feynmanliang commented 5 years ago

Hi Guillaume, sorry I haven't had a chance to work on it yet. We can close and I will resubmit a completed copy of changes when I find the time.

On Fri, Jul 19, 2019 at 3:45 AM Guillaume Gautier notifications@github.com wrote:

Hi @feynmanliang https://github.com/feynmanliang πŸ™‚ Any news about this PR ?

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/guilgautier/DPPy/pull/40?email_source=notifications&email_token=AAHRW5OOVGI3LIL3RKIPUHLQAFWKVA5CNFSM4H4I7QWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2K3Y3A#issuecomment-513129580, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHRW5NJUBIP34CUDA6VLFTQAFWKVANCNFSM4H4I7QWA .

guilgautier commented 5 years ago

Hi @feynmanliang,

I've spent some time reshaping the exact sampling section entirely. In particular, I've written a specific Caution section on why the chain cannot be applied blindly to sample k-DPP(L).

Feedback is welcome πŸ˜ƒ

Enjoy the rest of the summer 🌴