Multi-Modal VAE (THIS IS OLD VERSION)

A very simple experiment with 3D-Face dataset

1) Setup (brief)

Two-modal paired image data by:
1) private-1 = elevation (illumination = neutral value fixed)
2) private-2 = illumination (elevation = neutral value fixed)
3) shared = (azimuth, id)
Let: xI = modality 1, xT = modality 2

2) Models (Competing)

2-a) MuMo-VAE model

partition of latent variables = (zI, zT, zS)
dim(zI) = 2, dim(zT) = 2, dim(zS) = 5
2 decoders: 1) pI(xI | zI, zS), 2) pT(xT | zT, zS)
3 encoders (no parameter sharing): 1) qI(zI, zS | xI), 2) qT(zT, zS | xT), 3) q(zI, zT, zS | xI, xT)
3 VAE losses, one for each of {xI}, {xT}, and {(xI,xT)}

2-b) Vanilla VAE regarding (xI,xT) as (concatenated) observation

no partitioning of latent variables
there is only one encoder model q(z | xI, xT)
there is only one decoder model p(xI, xT | z)
dim(z) = 10

2-c) MMPOE-VAE v1: induce q(zI, zT, zS | xI, xT) from Product-of-Experts

partition of latent variables = (zI, zT, zS)
dim(zI) = 2, dim(zT) = 2, dim(zS) = 5
2 decoders: 1) pI(xI | zI, zS), 2) pT(xT | zT, zS)
2 encoders (no parameter sharing): 1) qI(zI, zS | xI), 2) qT(zT, zS | xT)
q(zI, zT, zS | xI, xT) = qI(zI|xI) qT(zT|xT) q(zS | xI, xT), where q(zS | xI, xT) \propto p(zS) qI(zS|xI) qT(zS|xT)
(why v1?) 1 VAE loss, for {(xI,xT)}

2-d) MMPOE-VAE v2: induce q(zI, zT, zS | xI, xT) from Product-of-Experts

The same setup as MMPOE-VAE-v1, but ...
(why v2?) 3 VAE losses, one for each of {xI}, {xT}, and {(xI,xT)}

2-e) WG-VAE v1: no private variables; induce q(z | xI, xT) from Product-of-Experts

Wu-Goodman's multi-modal VAE
no partition of latent variables, just shared z
dim(z) = 10
2 decoders: 1) pI(xI | z), 2) pT(xT | z)
2 encoders (no parameter sharing): 1) qI(z | xI), 2) qT(z | xT)
q(z | xI, xT) \propto p(z) qI(z|xI) qT(z|xT)
(why v1?) 1 VAE loss, for {(xI,xT)}

2-f) WG-VAE v2: no private variables; induce q(z | xI, xT) from Product-of-Experts

The same setup as WG-VAE-v1, but ...
(why v2?) 3 VAE losses, one for each of {xI}, {xT}, and {(xI,xT)}
So, the setup is pretty much the same as the original Wu-Goodman's multi-modal VAE

+ Latent traversal: (xI,xT) -> z or (zI,zS,zT), from which traverse along each axis -> (xI',xT')

(at iter 300K)

Trv-a) MuMo-VAE model

fixed3
fixed2
fixed1

(note: quite accurately identify private and shared factors, but computational issue of having dyadic inf net)

Trv-b) Vanilla VAE regarding (xI,xT) as (concatenated) observation

fixed3
fixed2
fixed1

(note: variation of z(4) or z(7), none of them shared factors, results in changes in both xI and xT)

Trv-c) MMPOE-VAE v1

fixed3 fixed2 fixed1

(note: problematic! eg, zS(1) learns elevation factor, but it should be a private factor in zI)

Trv-d) MMPOE-VAE v2

fixed3 fixed2 fixed1

(note: better identify/discern the private and shared factors, which implies that the loss terms for marginal data, ie, {xI} and {xT}, are necessary?)

Trv-e) WG-VAE v1

fixed3 fixed2 fixed1

Trv-f) WG-VAE v2

fixed3 fixed2 fixed1

+ Pure synthesis: z or (zI,zS,zT) ~ N(0,I) -> (xI,xT)

(at iter 300K)

PureSynth-a) MuMo-VAE model

[xI, xT]
synth_pure_300000

PureSynth-b) Vanilla VAE regarding (xI,xT) as (concatenated) observation

[xI, xT]
synth_300000

PureSynth-c) MMPOE-VAE v1

[xI, xT]
synth_pure_300000

(note: the quality of generated images is not satisfactory.. especially when compared to the v2 model below)

PureSynth-d) MMPOE-VAE v2

[xI, xT]
synth_pure_300000

PureSynth-e) WG-VAE v1

[xI, xT]
synth_pure_300000

PureSynth-f) WG-VAE v2

[xI, xT]
synth_pure_300000

+ Cross-modal synthesis: Given xI, infer zS, zT ~ N(0,I) -> xT (changing the role of I and T)

(at iter 300K)

CMSynth--a) MuMo-VAE model

XI -> XT [XI | three randomly synthesized XT images] synth_cross_modal_I2T_300000

(note: in the synthesized XT images, illumination (private-T) can vary, but elevation (private-I) should be neutral, and (azimuth, id) should be identical to those of XI)

XT -> XI [XT | three randomly synthesized XI images] synth_cross_modal_T2I_300000

(note: in the synthesized XI images, elevation (private-I) can vary, but illumination (private-T) should be neutral, and (azimuth, id) should be identical to those of XT)

CMSynth-b) Vanilla VAE regarding (xI,xT) as (concatenated) observation

Of course, N/A

CMSynth-c) MMPOE-VAE v1

XI -> XT [XI | three randomly synthesized XT images] synth_cross_modal_I2T_300000

XT -> XI [XT | three randomly synthesized XI images] synth_cross_modal_T2I_300000

(note: again, v1 suffers from poor quality of synthesized images. It seems to be necessary to take into account the marginal data {xI} and {xT} in the training..)

CMSynth-d) MMPOE-VAE v2

XI -> XT [XI | three randomly synthesized XT images] synth_cross_modal_I2T_300000

XT -> XI [XT | three randomly synthesized XI images] synth_cross_modal_T2I_300000

CMSynth-e) WG-VAE v1

XI -> XT [XI | a synthesized XT image] synth_cross_modal_I2T_300000

XT -> XI [XT | a synthesized XI image] synth_cross_modal_T2I_300000

CMSynth-f) WG-VAE v2

XI -> XT [XI | a synthesized XT image] synth_cross_modal_I2T_300000

XT -> XI [XT | a synthesized XI image] synth_cross_modal_T2I_300000

minyoungkim21 / multi-modal-vae-OLD

readme

Multi-Modal VAE (THIS IS OLD VERSION)

A very simple experiment with 3D-Face dataset

1) Setup (brief)

2) Models (Competing)

2-a) MuMo-VAE model

2-b) Vanilla VAE regarding (xI,xT) as (concatenated) observation

2-c) MMPOE-VAE v1: induce q(zI, zT, zS | xI, xT) from Product-of-Experts

2-d) MMPOE-VAE v2: induce q(zI, zT, zS | xI, xT) from Product-of-Experts

2-e) WG-VAE v1: no private variables; induce q(z | xI, xT) from Product-of-Experts

2-f) WG-VAE v2: no private variables; induce q(z | xI, xT) from Product-of-Experts

+ Latent traversal: (xI,xT) -> z or (zI,zS,zT), from which traverse along each axis -> (xI',xT')

Trv-a) MuMo-VAE model

Trv-b) Vanilla VAE regarding (xI,xT) as (concatenated) observation

Trv-c) MMPOE-VAE v1

Trv-d) MMPOE-VAE v2

Trv-e) WG-VAE v1

Trv-f) WG-VAE v2

+ Pure synthesis: z or (zI,zS,zT) ~ N(0,I) -> (xI,xT)

PureSynth-a) MuMo-VAE model

PureSynth-b) Vanilla VAE regarding (xI,xT) as (concatenated) observation

PureSynth-c) MMPOE-VAE v1

PureSynth-d) MMPOE-VAE v2

PureSynth-e) WG-VAE v1

PureSynth-f) WG-VAE v2

+ Cross-modal synthesis: Given xI, infer zS, zT ~ N(0,I) -> xT (changing the role of I and T)

CMSynth--a) MuMo-VAE model

CMSynth-b) Vanilla VAE regarding (xI,xT) as (concatenated) observation

CMSynth-c) MMPOE-VAE v1

CMSynth-d) MMPOE-VAE v2

CMSynth-e) WG-VAE v1

CMSynth-f) WG-VAE v2