minyoungkim21 / multi-modal-vae-OLD

0 stars 0 forks source link

Multi-Modal VAE (THIS IS OLD VERSION)

A very simple experiment with 3D-Face dataset

1) Setup (brief)

2) Models (Competing)

2-a) MuMo-VAE model

2-b) Vanilla VAE regarding (xI,xT) as (concatenated) observation

2-c) MMPOE-VAE v1: induce q(zI, zT, zS | xI, xT) from Product-of-Experts

2-d) MMPOE-VAE v2: induce q(zI, zT, zS | xI, xT) from Product-of-Experts

2-e) WG-VAE v1: no private variables; induce q(z | xI, xT) from Product-of-Experts

2-f) WG-VAE v2: no private variables; induce q(z | xI, xT) from Product-of-Experts


+ Latent traversal: (xI,xT) -> z or (zI,zS,zT), from which traverse along each axis -> (xI',xT')

(at iter 300K)

Trv-a) MuMo-VAE model

3 instances, each:
True xI | xI w/ zI(1) change | xI w/ zI(2) | xI w/ zS(1) | xI w/ zS(2) | ... | xI w/ zT(1) | xI w/ zT(2)
True xT | xT w/ zI(1) change | xT w/ zI(2) | xT w/ zS(1) | xT w/ zS(2) | ... | xT w/ zT(1) | xT w/ zT(2)

fixed3
fixed2
fixed1

(note: quite accurately identify private and shared factors, but computational issue of having dyadic inf net)

Trv-b) Vanilla VAE regarding (xI,xT) as (concatenated) observation

3 instances, each:
True xI | xI w/ z(1) change | xI w/ z(2) | ... | xI w/ z(10)
True xT | xT w/ z(1) change | xT w/ z(2) | ... | xT w/ z(10)

fixed3
fixed2
fixed1

(note: variation of z(4) or z(7), none of them shared factors, results in changes in both xI and xT)

Trv-c) MMPOE-VAE v1

3 instances, each:
True xI | xI w/ zI(1) change | xI w/ zI(2) | xI w/ zS(1) | xI w/ zS(2) | ... | xI w/ zT(1) | xI w/ zT(2)
True xT | xT w/ zI(1) change | xT w/ zI(2) | xT w/ zS(1) | xT w/ zS(2) | ... | xT w/ zT(1) | xT w/ zT(2)

fixed3 fixed2 fixed1

(note: problematic! eg, zS(1) learns elevation factor, but it should be a private factor in zI)

Trv-d) MMPOE-VAE v2

3 instances, each:
True xI | xI w/ zI(1) change | xI w/ zI(2) | xI w/ zS(1) | xI w/ zS(2) | ... | xI w/ zT(1) | xI w/ zT(2)
True xT | xT w/ zI(1) change | xT w/ zI(2) | xT w/ zS(1) | xT w/ zS(2) | ... | xT w/ zT(1) | xT w/ zT(2)

fixed3 fixed2 fixed1

(note: better identify/discern the private and shared factors, which implies that the loss terms for marginal data, ie, {xI} and {xT}, are necessary?)

Trv-e) WG-VAE v1

3 instances, each:
True xI | xI w/ z(1) change | xI w/ z(2) | ... | xI w/ z(10)
True xT | xT w/ z(1) change | xT w/ z(2) | ... | xT w/ z(10)

fixed3 fixed2 fixed1

Trv-f) WG-VAE v2

3 instances, each:
True xI | xI w/ z(1) change | xI w/ z(2) | ... | xI w/ z(10)
True xT | xT w/ z(1) change | xT w/ z(2) | ... | xT w/ z(10)

fixed3 fixed2 fixed1


+ Pure synthesis: z or (zI,zS,zT) ~ N(0,I) -> (xI,xT)

(at iter 300K)

PureSynth-a) MuMo-VAE model

[xI, xT]
synth_pure_300000

PureSynth-b) Vanilla VAE regarding (xI,xT) as (concatenated) observation

[xI, xT]
synth_300000

PureSynth-c) MMPOE-VAE v1

[xI, xT]
synth_pure_300000

(note: the quality of generated images is not satisfactory.. especially when compared to the v2 model below)

PureSynth-d) MMPOE-VAE v2

[xI, xT]
synth_pure_300000

PureSynth-e) WG-VAE v1

[xI, xT]
synth_pure_300000

PureSynth-f) WG-VAE v2

[xI, xT]
synth_pure_300000


+ Cross-modal synthesis: Given xI, infer zS, zT ~ N(0,I) -> xT (changing the role of I and T)

(at iter 300K)

CMSynth--a) MuMo-VAE model

XI -> XT [XI | three randomly synthesized XT images] synth_cross_modal_I2T_300000

(note: in the synthesized XT images, illumination (private-T) can vary, but elevation (private-I) should be neutral, and (azimuth, id) should be identical to those of XI)

XT -> XI [XT | three randomly synthesized XI images] synth_cross_modal_T2I_300000

(note: in the synthesized XI images, elevation (private-I) can vary, but illumination (private-T) should be neutral, and (azimuth, id) should be identical to those of XT)

CMSynth-b) Vanilla VAE regarding (xI,xT) as (concatenated) observation

Of course, N/A

CMSynth-c) MMPOE-VAE v1

XI -> XT [XI | three randomly synthesized XT images] synth_cross_modal_I2T_300000

XT -> XI [XT | three randomly synthesized XI images] synth_cross_modal_T2I_300000

(note: again, v1 suffers from poor quality of synthesized images. It seems to be necessary to take into account the marginal data {xI} and {xT} in the training..)

CMSynth-d) MMPOE-VAE v2

XI -> XT [XI | three randomly synthesized XT images] synth_cross_modal_I2T_300000

XT -> XI [XT | three randomly synthesized XI images] synth_cross_modal_T2I_300000

CMSynth-e) WG-VAE v1

XI -> XT [XI | a synthesized XT image] synth_cross_modal_I2T_300000

XT -> XI [XT | a synthesized XI image] synth_cross_modal_T2I_300000

CMSynth-f) WG-VAE v2

XI -> XT [XI | a synthesized XT image] synth_cross_modal_I2T_300000

XT -> XI [XT | a synthesized XI image] synth_cross_modal_T2I_300000