probml / pml2-book

Probabilistic Machine Learning: Advanced Topics
MIT License
1.37k stars 118 forks source link

Word repetitions (PDF Version: 2023-01-02) #201

Closed gdemelo closed 1 year ago

gdemelo commented 1 year ago

Found a bunch of repeated words, many of which appear to be erroneous (e.g. "An event is an an element"). I haven't checked all of them though.

Repeated token 'an' at: discuss in Section 2.1.1.4. 2.1.1.2 Discrete random variables The simplest setting is where the outcomes of the experiment constitute a countable set. For example, consider throwing a 3-sided die, where the faces are labeled “A”, “B” and “C”. (We choose 3 sides instead of 6 for brevity.) The sample space is Ω = {A, B, C}, which represents all the possible outcomes of the “experiment”. The event space is the set of all possible subsets of the sample space, so F = {∅, {A}, {B}, {C}, {A, B}, {A, C}, {B, C}, {A, B, C}}. An event is an an

Repeated token 'we' at: with broad 28 predictions, and then become more precise in our forecasts as we see more data, which makes intuitive 29 sense. For example, given D = {16}, there are many hypotheses with non-negligible posterior mass, 30 so the predicted support over the integers is broad. However, when we see D = {16, 8, 2, 64}, the 31 posterior concentrates its mass on one or two specific hypotheses, so the overall predicted support 32 becomes more focused. By contrast, the MLE picks the minimal consistent hypothesis, and predicts 33 the future using that single model. For example, if we we

Repeated token 'that' at: consider an example from [Ber85, p7]. Suppose x ∼ N (θ, 1). We observe 7 that x = 5 and we want to estimate θ. The MLE is of course θˆ = 5, which seems reasonable. The 8 posterior mean under a uniform prior is also θ = 5. But now suppose we know that the prior median 9 is 0, and that there is 25% probability that θ lies in any of the intervals (−∞, −1), (−1, 0), (0, 1), 10 (1, ∞). Let us also assume the prior is smooth and unimodal. 11 One can show that that

Repeated token 'be' at: 45 This is the same as the Jeffreys prior for a Gaussian with fixed mean µ and unknown scale σ. This 46 follows since F (σ) = 2/σ2, from Equation (2.282), and hence p(σ) ∝ 1/σ. 47 Draft of “Probabilistic Machine Learning: Advanced Topics”. January 2, 2023 3.8. HIERARCHICAL PRIORS 1 2 3.7.3.3 Learning invariant priors 3 Whenever we have knowledge of some kind of invariance we want our model to satisfy, we can use 4 this to encode a corresponding prior. Sometimes this is done analytically (see e.g., [Rob07, Ch.9]). 5 When this is intractable, it may be be

Repeated token 'we' at: to the data (by virtue 11 of the prior being weak). 12 A better approach is to use the conditional log marginal likelihood, which is defined as follows 13 [Lot+22]: 14 15 N 16 CLML(m) = log p(Dn|D1:n−1, m) 17 n=K (3.227) 18 19 where K ∈ {1, . . . , N } is a parameter of the algorithm. This evaluates the LML of the last N − K 20 datapoints, under the posterior given by the first K datapoints. We can reduce the dependence on 21 the ordering of the datapoints by averaging over orders; if we we

Repeated token 'use' at:

0.5|D) = p(θ|D)dθ 0.5 21 22 We can compute this quantity using numerical integration or analytically [Coo05]. (3.262) 23 24 3.12.3.2 Two sample test of relative proportions (χ2 test) 25 26 27 Now consider the setting where we have J groups, and in each group j we observe yj successes in Nj trials. We denote the success rate by θj, and we are interested in testing the hypothesis that θj is the same for all the groups. In frequentist statistics, we can use a χ2 test. Here we present a Bayesian 28 alternative. 29 30 We will use use

Repeated token 'are' at: the graph is directed, undirected, or some combination of directed and undirected, as we discuss in the sections below. More details on graphical models can be found in e.g., [KF09a]. 4.2 Directed graphical models (Bayes nets) In this section, we discuss directed probabilistic graphical models, or DPGM, which are based on directed acyclic graphs or DAGs (graphs that do not have any directed cycles). PGMs based on a DAG are often called Bayesian networks or Bayes nets for short; however, there is nothing inherently “Bayesian” about Bayesian networks: they are just a way of defining probability distributions. They are are

Repeated token 'the' at: 4.2.2, and in Section 4.2.4, we discuss how to read 31 off other conditional independence properties from the graph. 32 33 4.2.2 Examples 34 In this section, we give several examples of models that can be usefully represented as DPGM’s. 35 36 4.2.2.1 Markov chains 37 38 We can represent the conditional independence assumptions of a first-order Markov model using the 39 chain-structured DPGM shown in Figure 4.1(a). Consider a variable at a single time step t, which we 40 call the “present”. From the diagram, we see that information cannot flow from the past, x1:t−1, to 41 the the

Repeated token 'that' at:

2.5). (d) Conditional joint distribution, 29 p(x, y|z > 2.5). Adapted from [Clo20]. Generated by berksons_gaussian.ipynb. 30 31 32 As a simple example (from [PM18b, p198]), consider tossing two coins 100 times. Suppose you 33 only record the outcome of the experiment if at least one coin shows up heads. You should expect 34 to record about 75 entries. You will see that every time coin 1 is recorded as tails, coin 2 will be 35 recorded as heads. If we ignore the way in which the data was collected, we might infer from the fact 36 that that

Repeated token 'of' at: free grammar in Chomsky normal form. 12 13 14 The feature vectoreffiΨc(ixen, tyc)ocmopuunttastitohne onfutmhebderiscorfimtiimnaenstefaucnhctpiornod, wuchtiicohninruthlee wcaasse uofse(d1,.5a)nisdgiisveunsed to define the energy of a particbuylar tree structure, E(y|x) = −wTΨ(x, y). The probability distribution over trees is given by p(y|x) ∝ exp(−E(y|x)). From Figure 5.2 of [AHT07]. Used with kind permission of Yasemin Altun. 15 lx lx−1 16 F (x, y; w) = wol, Φ(xt) ⊗ Λc(yt) + η wll, Λc(yt) ⊗ Λc(yt+1) , yt=1 1 yt=1 2 17 (1.6) 18 19 where w = wol As indicated in wll is the concatenation Propositionx11, the inner of weights product of of

Repeated token 'a' at: RecCPD is the conditional probability distribution (CPD) for the recommendation node. If 23 represented as a conditional probability table (CPT), this has 2 × 5 × 5 = 50 rows, each with 5 24 entries. This table can encode our assumptions about what kind of ratings a book receives based on 25 the quality of the book, but also properties of the reviewer, such as their honest and kindness. (More 26 sophisticated models of human raters in the context of crowd-sourced data collection can be found in 27 e.g., [LRC19].) 28 We can convert the above formulae into a a

Repeated token 'same' at: = p(y|µ, σ) and pθ = p(y|µ , σ ). The (squared) Euclidean distance between the parameter vectors decomposes as ||θ − θ ||2 = (µ − µ )2 + (σ − σ )2. However, the predictive distribution has the form exp(− 1 2σ2 (y − µ)2), so changes in µ need to be 34 measured relative to σ. This is illustrated in Figure 6.2(a-b), which shows two univariate Gaussian 35 36 distributions (dotted and solid lines) whose means differ by δ. small variance σ2, whereas in Figure 6.2(b), they share the In Figure 6.2(a), they share the same same

Repeated token 'by' at: N } be the data we condition on. The Bayesian posterior can be written as 9 10 11 12 p(θ|D) = Z 1 (D) π0(θ) N n=1 p(yn|fθ (xn)) (6.173) 13 This can be derived by minimizing the BLR, since 14 N 15 L(q) = −Eq(θ) log p(yn|fθ(xn)) + DKL (q(θ) π0(θ)) 16  n=1 17  18 19 = Eq(θ) log π0(θ) Z (D) q(θ) N n=1 p(yn|fθ (xn))  − log Z (D) 20 21 = DKL (q(θ) p(θ|D)) − log Z(D) (6.174) (6.175) (6.176) 22 Since Z(D) is a constant, we can minimize the loss by by

Repeated token 'also' at: Machine Learning: Advanced Topics”. January 2, 2023 6.7. THE BAYESIAN LEARNING RULE 1 2 This is also faster to compute. 3 Putting all this together gives rise to the online Gauss-Newton or OGN method of [Osa+19a]. 4 If we drop the delta approximation, and work with expectations, we get the variational pnline 5 Gauss-Newton or VOGN method of [Kha+18]. We can approximate the expectations by sampling. 6 In particular, VOGN uses the following weight perturbation method 7 8 Eqt ∇ˆ θ (θ) ≈ ∇ˆ θ (θt + t) 9 (6.213) 10 where t ∼ N (0, diag(st)). It also also

Repeated token 'an' at: distribution, i.e., it finds one of the most likely states, 34 or lowest energy states. This approach can be used for both discrete and continuous optimization. 35 See Section 12.9.1 for more details. 36 37 6.9.3 Evolutionary algorithms 38 39 Stochastic local search (SLS) maintains a single “best guess” at each step, xt. If we run this for 40 T steps, and restart K times, the total cost is T K. A natural alternative is to maintain a set or 41 population of K good candidates, St, which we try to improve at each step. This is called an an

Repeated token 'use' at: mutation 37 rules. There are many possible choices for these heuristics. We briefly mention a few of them below. 38 39 • In a genetic algorithm (GA) [Gol89; Hol92], we use mutation and a particular recombination 40 method based on crossover. To implement crossover, we assume each individual is represented 41 as a vector of integers or binary numbers, by analogy to chromosomes. We pick a split point 42 along the chromosome for each of the two chosen parents, and then swap the strings, as illustrated 43 in Figure 6.13. 44 45 • In genetic programming [Koz92], we use use

Repeated token 'that' at: subset of joint probability distributions 35 P(X × X ) with marginals µ and ν, namely 36 Π(µ, ν) {π ∈ P(X 2) : ∀A ⊂ X , π(A × X ) = µ(A) and π(X × A) = ν(A)}. 37 (6.242) 38 Note that Π(µ, ν) is not empty since it always contains the product measure µ ⊗ ν. With this 39 definition, the continuous formulation of (6.239) can be obtained as 40 41 42 OTc(µ, ν) inf c dπ . π∈Π(µ,ν) X 2 43 (6.243) 44 Notice that (6.243) subsumes directly (6.239), since one can check that that

Repeated token 'the' at: systems, discussed in 5 Section 2.2.6. 6 First we derive the prediction step. From Equation (2.100), the joint predictive distribution for 7 states is given by 8 9 p(zt−1, zt|y1:t−1) = p(zt|zt−1)p(zt−1|y1:t−1) 10 11 = N (zt|Ftzt−1, Qt)N (zt−1|µt−1|t−1, Σt−1|t−1) 12 13 =N zt−1 zt |µ , Σ (8.37) (8.38) (8.39) 14 where 15 16 17 µ= µt−1|t−1 Ftµt−1|t−1 ,Σ = Σt−1|t−1 FtΣt−1|t−1 Σt−1|t−1FTt FtΣt−1|t−1FTt + Qt 18 Hence the marginal predictive distribution for states is given by 19 (8.40) 20 p(zt|y1:t−1) = N (zt|Ftµt−1|t−1, FtΣt−1|t−1FTt + Qt) = N (zt|µt|t−1, Σt|t−1) 21 (8.41) 22 Now we derive the the

Repeated token 'are' at: (8.69). 31 32 8.2.3.3 Two-filter smoothing 33 34 Note that the backwards pass of the Kalman smoother does not need access to the observations, y1:T , 35 but does need access to the filtered belief states from the forwards pass, p(zt|y1:t) = N (zt|µt|t, Σt|t). 36 There is an alternative version of the algorithm, known as two-filter smoothing [FP69; Kit04], in 37 which we compute the forwards pass as usual, and then separately compute backwards messages 38 p(yt+1:T |zt) ∝ N (zt|µbt|t, Σtb|t), similar to the backwards filtering algorithm in HMMs (Section 9.2.3). 39 However, these backwards messages are are

Repeated token 'the' at: been linearized. 19 The other setting where the EKF works poorly is when the function is highly nonlinear near the 20 current mean (see Figure 8.5a). 21 A more accurate approach is to use a second-order Taylor series approximation, known as the 22 second order EKF. The resulting updates can still be computed in closed form (see [SS23, Sec 7.3] 23 for details). We can further improve performance by repeatedly re-linearizing the equations around 24 µt instead of µt|t−1; this is called the iterated EKF (see Section 8.3.2.2). In Section 8.4.2, we will 25 discuss an algorithm called the the

Repeated token 'the' at: and measurement functions, 18 and then passing a Gaussian distribtution through the linearized functions, we instead approximate 19 the joint distributions p(zt−1, zt|y1:t−1) and p(zt, yt|y1:t−1) by Gaussians, where the moments are 20 computed using numerical integration; we can then compute the marginal and conditional of these 21 distributions to perform the time and measurement updates. 22 There are many methods to compute the Gaussian moments, as we discuss in Section 8.5.1. Here 23 we use a method based on the unscented transform (see Section 8.4.1). Using the unscented transform 24 for the transition and observation models gives the the

Repeated token 'where' at: state:ex2te5nd IMM to the smoothing case, unlike GPB2, a smoothing version of which isTdhisecpursosbedabiinliStyecotfioansa6m.1p.3le. path, P (xi1:t|y1:t), can be computed recursively using Bayes rule. Typically we will want 26 5.3.2 Viterbi approximation the proposal distribution to be recursive also, i.e., q(x1:t|y1:t) = q(xt|x1:t−1, y1:t)q(x1:t−1|y1:t−1). In this case we have b p(z , m = i|y ) = π N (z |µ , Σ )If2th7ere are a i ltar−ge1number of by28GPB2 and IMM. Instead, dtis−cr1ete one can ii evnauritma−belre1ast,eitthmeadyisbcer1etot:eto−vsal1oluwestoinpearpfortiro−mri1Mo|rt2d−eorr1oefvpenroMbabKti−lFity1u.p(dCatoetms−,pau1st|irtne−gqut1hireeidwr ti posterior probability is as expensive as an exact update step.) This makes sense for a DBN for fault diagnosis, where where

Repeated token 'the' at: of each other, with q(θ) held fixed, and then optimizing q(θ) with the qn held fixed. This is known as variational Bayes EM [BG06]. It is similar to regular EM, except in the E step, 15 16 we infer an approximate posterior for zn averaging out the parameters (instead of plugging in a point estimate), and in the M step, we update the parameter posterior parameters using the expected 17 sufficient statistics. 18 Now suppose we approximate q(θ) by a delta function, q(θ) = δ(θ − θˆ). The Bayesian LVM ELBO 19 objective from Equation (10.68) simplifies to the the

Repeated token 'the' at: expected 17 sufficient statistics. 18 Now suppose we approximate q(θ) by a delta function, q(θ) = δ(θ − θˆ). The Bayesian LVM ELBO 19 objective from Equation (10.68) simplifies to the the “LVM ELBO”: 20 21 22 Ł(ψ1:N , θ|D) = Eq(z1:N |ψ1:N ) [log p(θ, D, z1:N ) − log q(z1:N |ψ1:N )] (10.69) 23 We can optimize this using the variational EM algorithm, which is a CAVI algorithm which updates 24 the ψn in parallel in the variational E step, and then updates θ in the M step. 25 VEM is simpler than VBEM since in the the

Repeated token 'called' at: perform a full pass over the data every τ steps. An 18 alternative approach, called SAGA-LD [Dub+16; Cha+18] (which stands for stochastic averaged 19 gradient acceleration), avoids this by storing all N gradient vectors, and then doing incremental 20 updates. Unfortunately the memory requirements of this algorithm usually make it impractical. 21 22 23 12.7.4 SG-HMC 24 We discussed Hamiltonian Monte Carlo (HMC) in Section 12.5, which uses auxiliary momentum 25 variables to improve performance over Langevin MC. In this section, we discuss a way to speed it up 26 by approximating the gradients using minibatches. This is called called

Repeated token 'see' at: weight — will give the same result in expectation as the original 34 weighted estimate. However, to reduce the variance of the method, we need to pick the resampling 35 method carefully, as we discuss below. 36 37 13.2.4.1 Inverse cdf 38 39 Most of the common resampling methods work as follows. First we form the cumulative distribution 40 from the weights W 1:N , as illustrated by the staircase in Figure 13.4. (We drop the t index for 41 brevity.) Then, given a set of N uniform random variables, U i ∼ Unif(0, 1), we check to see see

Repeated token 'the' at: 6 600 7 8 400 9 10 200 11 0 12 13 0 5 10 t 15 20 25 14 Figure 13.5: Effective sample size at each step for the bootstrap particle filter and a guided particle fil- 15 ter for a Gaussian SSM with Poisson likelihood. Adapted from Figure 10.4 of [CP20b]. Generated by 16 pf_guided_neural_decoding.ipynb. 17 18 19 20 where Ht∗ is the Hessian of the log-likelihood at the mode. We now compute p(zt|zti−1, yt) using the update step of the Kalman filter, using the same equations as in Section 13.3.2. This combination is 21 called the the

Repeated token 'the' at: object that has the following motion model: 20 21 p(zt|zt−1, mt = k) = N (zt|Fzt−1 + bk, Q) 22 (13.65) 23 where zt = (x1t, x˙ 1t, x2t, x˙ 2t) contains the 2d position and velocity. We define the observaton matrix 24 by H = I and the observation covariance by R = 10 diag(2, 1, 2, 1). We define the dynamics matrix 25 by 26   27 1 ∆ 0 0 28 29 F = 00 1 0 0 1 ∆0  30 0 0 0 1 (13.66) 31 where ∆ = 0.1,. We set the the

Repeated token 'to' at: often denote the latent variable z by θ, we 29 define γ˜0(θ) ∝ π0(θ) as the prior, and γ˜(z) = π0(θ)p(D|θ) as the posterior. We can then define the 30 intermediate distributions to be 31 32 γ˜t(θ) = π0(θ)1−λt π0(θ)λt p(D|θ)λt = π0(θ)1−λt exp[−λtE (θ)] (13.76) 33 34 where E(θ) = − log p(D, θ) is the energy (potential) function. The incremental weights are given by 35 36 37 αt(θ) = π0(θ)1−λt exp[−λtE(θ)] π0(θ)1−λt exp[−λt−1E (θ)] = exp[−δt E (θ)] (13.77) 38 where λt = λt−1 + δt. 39 For this method to work well, it is important to to

Repeated token 'the' at: E j ) = P (Ynj,t+1|Xnj ,t, E j ) 41 t=T (14.33) 42 43 qjn = Q(ynj |xjn, DTj ) = Q(ynj |xnj , θ)Q(θ|DTj )dθ (14.34) 44 45 ≈ 1 M M T +τ −1 Q(Ynj,t+1|Xnj ,t, θmj ) 46 m=1 t=T (14.35) 47 Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 578 1 2 where θmj ∼ Q(θ|DTj ) is a sample from the agent’s posterior over the environment. 3 The above assumes that P (Y |X) is known; this will be the case if we use a synthetic data generator, 4 as in the the

Repeated token 'for' at: 4 150 5 125 height height height 200 175 150 125 180 160 140 120 100 6 100 100 80 7 8 75 75 60 50 6.2 20.9 we3ig5h.6t 50.3 65.0 50 6.2 20.9 we3ig5h.6t 50.3 65.0 6.2 20.9 we3ig5h.6t 50.3 65.0 9 10 (a) (b) (c) 11 12 Figure 15.2: Linear regression for predicting height given weight for the full dataset (including children) using 13 polynomial regression. (a) Posterior fit for linear model with log-Gaussian prior for β1. (b) Posterior fit quadratic model with log-Gaussian prior for β2. (c) Posterior fit for quadratic model with Gaussian prior for for

Repeated token 'a' at: (a) Representing lasso using a Gaussian scale mixture prior. (b) Graphical model for group lasso 20 with 2 groups, the first has size G1 = 2, the second has size G2 = 3. 21 22 23 where λ = log π 1−π controls the sparsity of the model, and ||w||0 = D d=1 I (wd = 0) is the 0 norm 24 of the weights. Thus MAP estimation with a spike and slab prior is equivalent 0 regularization; 25 this penalizes the number of non-zero coefficients. Interestingly, posterior samples will also be sparse. 26 By contrast, consider using a a

Repeated token 'as' at: . . , C} is the class label, W is a C × D weight matrix, b 43 is C-dimensional bias vector, and softmax() is the softmax function, defined as 44 45 softmax(a) 46 47 ea1 C c =1 eac ,..., eaC C c =1 eac (15.131) Draft of “Probabilistic Machine Learning: Advanced Topics”. January 2, 2023 15.3. LOGISTIC REGRESSION 1 2 If we define the logits as ηn = Wxn + b, the probabilities as µn = softmax(ηn), and let yn be 3 the one-hot encoding of the label yn, then the log likelihood can be written as as

Repeated token 'admit' at: 89% credible interval for α1 is [−0.29, 0.16] and for α2 is [−0.91, 0.75].2 42 The corresponding distribution for the difference in probability, σ(α1) − σ(α2), is [0.12, 0.16], with a 43 mean of 0.14. So it seems that Berkeley is biased in favor of men. 44 45 2. McElreath uses 89% interval instead of 95% to emphasize the arbitrary nature of these values. The difference is 46 insignificant. 47 Draft of “Probabilistic Machine Learning: Advanced Topics”. January 2, 2023 15.3. LOGISTIC REGRESSION 1 2 1.0 3 1.0 4 0.8 A 5B 6 0.6 0.8 A 0.6 B admit admit

Repeated token 'the' at: 54 45 72 8 *8 565478 5 56 75 74 97 28 1 Filter 2 577921 9 5 57 87 59 32 81 4 585384 10 =5 8 5 3 8 4 1 0 -1 1 0 -1 0 1 0 0 0 -1 1 0 -1 1 1 1 0 1 -1 11 6x6x3 1 0 -1 -1 -1 -1 4x4x2 12 3x3x3 4x4 13 14 Figure 16.3: A 2d convolutional layer with 3 input channels and 2 output channels. The kernel has size 3 × 3 15 and we use stride 1 with 0 padding, so the the

Repeated token 'an' at: 2d, this becomes 17 18 (xTx )2 = (x1x1 + x2x2)2 = (x1x1)2 + (x2x2)2 + 2(x1x1)(x2x2) (18.24) 19 20 We can generalize this to contain all terms up to degree M by using the inhomogeneous 21 polynomial kernel 22 23 K(x, x ) = (xTx + c)M (18.25) 24 25 For example, if M = 2 and the inputs are 2d, we have 26 27 (xTx + 1)2 = (x1x1)2 + (x1x1)(x2x2) + (x1x1) 28 + (x2x2)(x1x1) + (x2x2)2 + (x2x2) 29 + (x1x1) + (x2x2) + 1 30 (18.26) 31 18.2.2.2 Gibbs kernel 32 33 Consider an an

Repeated token 'term' at: compute the likelihood of these parameters as follows: 20 21 p(y|X, θ) = p(D|θ) = p(y|fX , θ)p(fX |X, θ)dfX 22 (18.73) 23 Since we are integrating out the function f , we often call θ hyperparameters, and the quantity p(D|θ) 24 the marginal likelihood. 25 Since f is a GP, we can compute the above integral using the marginal likelihood for the corre- 26 sponding Gaussian. This gives 27 28 29 log p(D|θ) = − 1 2 (y − µX )TKσ−1(y − µX ) − 1 2 log |Kσ| − N 2 log(2π) (18.74) 30 The first term term

Repeated token 'to' at: efficiently using Kronecker structure, as used by KISS. Additionally, 27 we can represent m efficiently using the tensor train decomposition [Ose11] in combination with 28 SKI [WN15]. The resulting TT-GP method can scale efficiently to billions of inducing points, as 29 explained in [INK18]. 30 31 32 18.5.6 Converting a GP to an SSM 33 Consider a function defined on a 1d scalar input, such as a time index. For many stationary 34 1d kernels, the corresponding GP can be modeled using a linear time invariant (LTI) stochastic 35 differential equation (SDE)3; this SDE can then be converted to to

Repeated token 'some' at: to adapt the model to the target distribution. If we have some 33 labeled data from the target distribution, we can use transfer learning, as we discuss in Section 19.5.1. 34 However, getting labeled data from the target distribution is often not an option. Therefore, in the 35 other sections, we discuss techniques that just rely on unlabeled data from the target distribution. 36 37 19.5.1 Supervised adaptation using transfer learning 38 39 40 Suppose we have labeled training data from a source distribution, Ds = {(xn, yn) ∼ p : n = 1 : Ns}, and also some some

Repeated token 'use' at: 47 Draft of “Probabilistic Machine Learning: Advanced Topics”. January 2, 2023 20.3. GOALS OF GENERATIVE MODELING 1 2 the objective is usually non-convex, so we can only reach a local optimum. For other models, we 3 cannot tractably compute the likelihood. In the case of VAEs, we maximize a lower bound on the 4 likelihood; in the case of EBMs and UGMs, we maximize an approximation to the likelihood. For 5 GANs we have to use min-max training, which can be unstable, and there is no clear objective 6 function to monitor. 7 • Latents: does the model use use

Repeated token 'loge' at: p(z) − log q(z|x)] (20.4) (20.5) 37 Thus if we model the density of z ∼ q(z|x), which is a dequantized version of x, we will get a lower 38 bound on p(x). 39 40 41 20.4.1.2 Challenges with using the likelihood 42 Unfortunately, there are several challenges with using the likelihood to evaluate generative models, 43 some of which we discuss below. 44 45 4. In some applications, we report bits per dimension, which is the NLL using log base 2, divided by the dimensionality 46 of x. (To compute this metric, recall that log2 L = loge loge

Repeated token 'we' at: ∼ qφ(z|x). We explain this in detail in Section 6.5.4, but we summarize 46 the basic idea here. 47 Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 784 1 2 The key trick is to rewrite the random variable z ∼ qφ(z|x) as some differentiable (and invertible) 3 transformation r of another random variable ∼ p( ), which does not depend on φ, i.e., we assume 4 we can write 5 z = r( , φ, x) 6 7 For example, (21.29) 8 z ∼ N (µ, diag(σ)) ⇐⇒ z = µ + 9 10 Using this, we we

Repeated token 'favorite' at: τ τyNmax-1 yNmax S1 h i were to buy any groceries . 19 he was silent for a moment . horses are to buy any groceries . 20 21 it was quiet for a moment . h iF t was dark and cold .Forward h F Forward h F Forward there was a pause .0 Encoder 1 Encoder 2 Encoder RNN RNN RNN Forward Encoder RNN hF Ns-1 Forward Encoder RNN h→ μ hz orses tanh horses are toh 0 Decoder buhy 1 aDnecyoderanih 2malDeco.der the RfNaN vorite RaNNny animaRNlN . Decoder RNN h Ns-1 Decoder RNN horses the favorite favorite

Repeated token 'does' at: and last line). Note that the intermediate sentences are grammatical, and semantically related to 27 their neighbors. From Table 8 of [Bow+16b]. (b) Same as (a), but now using a deterministic autoencoder 28 (with the same RNN encoder and decoder). From Table 1 of [Bow+16b]. Used with kind permission of Sam 29 Bowman. 30 31 32 21.3.6.2 Applications 33 34 In this section, we discuss some applications of VAEs to sequence data. 35 36 Text 37 38 In [Bow+16b], they apply the VAE-RNN model to natural language sentences. (See also [MB16; 39 SSB17] for related work.) Although this does does

Repeated token 'as' at: qθ(x) satisfies the above requirement and leads to a consistent 34 estimator [Vaa00], but this choice of s is not available for implicit generative models. 35 This motivates the search for other approaches of integrating the method of moments for implicit 36 models. The MMD can be seen as a moment matching criteria, by matching the means of the 37 two distributions after lifting the data into the feature space of an RHKS. But moment matching 38 can go beyond integral probability metrics: Ravuri et al. [Rav+18] show that one can learn useful 39 moments by using s as as

Repeated token 'well' at: using discriminator and generator 33 regularization, as well as more complex optimization methods. 34 35 36 26.3.4 Improving GAN optimization 37 38 Hyperparameter choices such as the choice of momentum can be crucial when training GANs, with 39 lower momentum values being preferred compared to the usual high momentum used in supervised 40 learning. Algorithms such as Adam [KB14a] provide a great boost in performance [RMC16a]. Many 41 other optimization methods have been successfully applied to GANs, such as those which target 42 variance reduction [Cha+19b]; those which backpropagate through gradient steps, thus ensuring that 43 generator does well well

Repeated token 'are' at: or number of words). Due to 12 the discrete nature of text, GAN models trained on text are explicit, since they explicitly model the 13 probability distribution of the output, rather than modeling the sampling path. This is unlike most 14 GAN models of continuous data such as images that we have discussed in the chapter so far, though 15 explicit GANs of continuous data do exist [Die+19b]. 16 The discrete nature of text is why maximum likelihood is one of the most common methods of 17 learning generative models of text. However, models trained with maximum likelihood are are

Repeated token 'to' at: = πce˜lc Z˜ 14 (28.30) 15 16 17 where L = maxc lc, ˜lc = lc − L, and Z˜ = c πce˜lc . This lets us combine the class prior probability πc with the scaled class conditional log likelihood ˜lc to get the class posterior in a stable way. (We can also compute the log normalization constant, log p(x) = log Z = log(Z˜) + L.) 18 19 20 To compute a single Gaussian log density, ck = and Σc−k1. To make this efficient, we can use the log N (x|µck, Σck), we need matrix determinant lemma to to

Repeated token 'the' at: 11 (a) (b) 12 13 Figure 28.15: Gaussian latent factor models for paired data. (a) Supervised PCA. (b) Partial least squares. 14 15 16 image: 17 18 19 xˆ = Wk∗ zˆ + µk∗ (28.96) 20 We then use the estimate x = [x1, xˆ2], so the observed pixels are not changed. This is an example of 21 image imputation, and is illustrated in Figure 28.14. Note that we can condition on an arbitrary 22 subset of pixels, and fill in the rest, whereas some other models (e.g., autoregressive models) can only 23 predict the bottom right given the the

Repeated token 'magnetic' at: aids infection viruses bacteria bacterial host resistance parasite gene disease mutations families mutation patients disease treatment drugs clinical cells proteins researchers protein found neurons brain memory subjects left stimulus motor visual cortical synapses ltp task surface glutamate proteins protein binding domain domains rna dna rna polymerase cleavage site computer problem information computers problems tip image sample device laser materials organic polymer polymers molecules synaptic neurons physicists particles enzyme enzymes sequence sequences genome dna sequencing surface liquid surfaces fluid model optical light electrons quantum reaction physics particle experiment stars astronomers iron active site reduction plants plant gene genes arabidopsis magnetic magnetic

Repeated token 'carbon' at: surface liquid surfaces fluid model optical light electrons quantum reaction physics particle experiment stars astronomers iron active site reduction plants plant gene genes arabidopsis magnetic magnetic field spin superconductivity superconducting reactions molecule molecules transition state pressure high pressure mantle crust universe galaxies galaxy sun pressures upper mantle solar wind fossil record core meteorites earth development embryos drosophila genes expression genetic population populations differences variation birds fossils dinosaurs fossil inner core species forest forests populations ecosystems ancient found impact million years ago africa volcanic deposits magma eruption ratios planets planet earthquake earthquakes fault images data climate ocean ice co2 carbon carbon

Repeated token 'represent' at: are the first t − 1 examples. (For brevity, we drop the conditioning on σ2.) We see that the previous posterior, p(w|D1:t−1), becomes the 9 current prior, which gets updated by Dt to become the new posterior, p(w|D1:t). This is an example 10 of sequential Bayesian updating or online Bayesian inference. In the case of linear regression, this 11 process is known as the recursive least squares or RLS algorithm. 12 We can implement this method by using a linear Gaussian state-space model(Section 29.6). The 13 14 basic idea is observation to let the model Ht hidden state represent represent

Repeated token 'the' at: this model by using EM, where in the E step we approximate p(yt|zt) using a Laplace 28 approximation, after which we can use the Kalman smoother to compute p(z1:T |y1:T ). In the M step, 29 we optimize the expected complete data log likelihood, similar to Section 29.8.1. We show the result 30 in Figure 29.31, where we compare the parameters A and the posterior trajectory E [zt|y1:T ] using 31 the true model and the estimated model. We see good agreement. 32 33 29.11.2 Example: stochastic volatility models 34 In 35 finance, it is common to model the the

Repeated token 'who' at: [SSF16], they tackle this using a log-concave multi-stage likelihood, 35 in which yt = 0 is emitted with probability σ(dt0); otherwise yt = 1 is emitted with probability σ(d1t ); 36 otherwise yt = is emitted with probability Poi(dt2). This generalizes the scheme in [SOB12]. 37 38 29.12.5.4 Example: hierarchical SSM for electoral panel data 39 40 41 42 Suppose we perform a survey vote at time t in state j, and Ntj − ytj vote Republican.) It for let the US presidential elections. Let ytj be the number of those people Ntj be the number of people who who

Repeated token 'have' at: representative examples [Yeh+18]. These examples might be 13 chosen to be somehow characteristic of, or providing coverage of, a class (e.g., [AA18]), to draw 14 attention to decision boundaries (e.g., [Zhu+18]), or to identify inputs particularly influential in 15 training the model. 16 Unless a model is inherently interpretable, it is still important to remember that a global explanation 17 is still a partial view. To make a complex model accessible to the user, the global explanation will 18 need to leave some things out. 19 Decision: determining what properties the context needs. Different forms of explanations 20 have have

Repeated token 'the' at: be the agent’s policy at time t. Then the per-step regret at t is defined as 5 6 lt Ep(st) [R(st, π∗(st))] − Eπt(at|st)p(st) [R(st, at)] (34.58) 7 If we only care about the final performance of the best discovered arm, as in most optimization 8 problems, it is enough to look at the simple regret at the last step, namely lT . Optimizing simple 9 regret results in a problem known as pure exploration [BMS11], since there is no need to exploit 10 the information during the learning process. However, it is more common to focus on the the

Repeated token 'these' at: this case, the policy is optimal. policy πk is Since there greedy with are at most respect to its own value function |A||S| deterministic policies, and every iteration strictly improves the policy, the algorithm must converge after finite iterations. 23 In PI, we alternate between policy evaluation (which involves multiple iterations, until convergence 24 of 25 Vπ ), and policy improvement. In VI, we alternate between one iteration of policy evaluation followed by one iteration of policy improvement (the “max” operator in the update rule). In generalized 26 policy improvement, we are free to intermix any number of these these

Repeated token 'the' at: (1 − γ) normalizes d to be a valid distribution, so that it sums to unity. With this 14 interpretation of d, the objective in Equation (34.81) is just the average per-step reward under the 15 normalized occupancy distribution. Once an optimal solution d∗ is found, an optimal policy can be 16 immediately obtained by π∗(a|s) = d∗(s, a)/ a d∗(s, a ). 17 A challenge in solving the primal or dual LPs for MDPs is the large number of constraints and 18 variables. Approximations are needed, where the variables are parameterized (either linearly or 19 nonlinearly), and the the

murphyk commented 1 year ago

thanks! How did you find these errors? DId you run OCR on the PDF? It seems there are some errors (see below). Also it's interesting that I am not the only author to make this common mistake (several invited chapters have the same stuttering error - seems to be hard to notice....).

Repeated token 'loge' at:
p(z) − log q(z|x)] (20.4) (20.5) 37 Thus if we model the density of z ∼ q(z|x), which is a dequantized version of x, we will get a lower 38 bound on p(x). 39 40 41 20.4.1.2 Challenges with using the likelihood 42 Unfortunately, there are several challenges with using the likelihood to evaluate generative models, 43 some of which we discuss below. 44 45 4. In some applications, we report bits per dimension, which is the NLL using log base 2, divided by the dimensionality 46 of x. (To compute this metric, recall that log2 L = loge loge
gdemelo commented 1 year ago

It happens to me as well, which is why I try to check for this. Tools such as pdftotext from the Poppler utilities can extract the embedded plaintext from the PDF without OCR. Sorry for any errors in the process – there are indeed some false positives and there may also have been a few false negatives. Thanks for all this impressive work – looking forward to the Dagstuhl workshop!