adding "Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment"

cleong110 commented 1 month ago

Making a PR to add http://arxiv.org/abs/2312.15645, one of the recent approaches (with code!) listed on https://github.com/ZechengLi19/Awesome-Sign-Language.

I only intended to include f435dcadd968e357b75425fec5902a57dfb96180 in this PR, unfortunately, I made the previous PR on my fork's master branch so this PR now includes those commits? I think in future if I make a new branch on my end for each commit that should prevent it. And I believe once the previous PR is merged this one should be mergeable easily as well?

cleong110 commented 1 month ago

Also: a lot of the statistical/mathematical language in this paper was difficult for me, so the potential for misunderstandings is higher

AmitMY commented 1 month ago

Please pull and merge master into this branch. that should remove any changes from other PRs from the changelist. If not, please create a separate PR with only the relevant changes. (You should always branch out from master)

cleong110 commented 1 month ago

Please pull and merge master into this branch. that should remove any changes from other PRs from the changelist. If not, please create a separate PR with only the relevant changes. (You should always branch out from master)

Can do, I'll give it a try

cleong110 commented 1 month ago

That did it!

cleong110 commented 1 month ago

Thanks for all the helpful suggestions! Working on rewrites.

Re: "please pull first", I think it is in the correct state now, cleong110:paper/CV-SLT now shows as being synced up with sign-language-processing:master, is there more to do on that front?

cleong110 commented 1 month ago

Asked GPT4-o to help summarize the PDF of the paper for me, as advised. The conversation and prompt are at https://chat.openai.com/share/c720d444-a505-491e-b131-78e03e10e700. It actually did quite well, I think!

The explanation aligns with my understanding and seems to clarify. And seems to line up with Figure 2 here:

OK, so it seems the encoder is supposed to generate distributions of possible embeddings, and the decoder is supposed to generate distributions of possible text translations. And the KL divergences are there to, like, pull the distributions with/without text inputs closer.

So I guess the training process goes like:

Give the model visual input only.
Encoder (some kind of vision transformer) takes that visual input and does the usual encoding stuff, making hidden states/embeddings. Those are the gray boxes
Encoder then passes that on to a Gaussian Network (I don't know the architecture of this one), and the GN generates a distribution of embeddings, the green boxes in this picture
Decoder takes the hidden states and the distributions from the Encoder and then outputs a distribution of possible translations, the orange boxes here
But then you do that whole thing, steps 2 through 4 again, but this time the model gets to include both input video and input text. Again you generate distributions for the embeddings, and distributions for the text translations
Now you've got your with-text and without-text distributions. KL divergences let you calculate how "far" apart those distributions are. You want to minimize these numbers. So you, presumably, use the KL divergences as a loss for backprop, I guess? Yeah looks like that's correct

So you've got three loss terms. One overall reconstruction loss, one for the encoder to help it align with-visual and without-visual distributions, and one for the decoder to help with consistency of outputs.

$L_{CVAE}$ has a KL in it, it is presumably the one that works on consistency of embeddings/latent variables, as it deals with things like $q(z|x,y)$, which I read as "posterior distribution of latent variables given we have videos (x) and text (y). So this would be the "make ENCODER consistent with and without text" loss. This loss going down means that the "no text" encoder is generating latent variables more similar to the "with text" encoder's.
$L_{SD}$ Would then be the "DECODER consistent with and without text" loss. There's a KL divergence in there. This loss going down means that the full "no text" path is generating outputs more similar to the "with text" path.
$L_{AEP}$, then, has to be the overall reconstruction loss, for the prior path. This loss going down means that the prior path (without text) is getting better at generating the right translations.

So over time, these three losses go down. The encoder gets more consistent at encoding/embedding to the same thing internally with or without text. The decoder, given those embeddings, gets more consistent about generating the same output either way, with or without text. And the ability of the overall prior path to get to the right text gets better.

I think I'm ready to rewrite it now.

cleong110 commented 1 month ago

Wrote a new summary myself, then asked ChatGPT 3.5 to rewrite for conciseness.

A few minor edits to ChatGPT's version for style guide purposes gives us:

@zhaoConditionalVariationalAutoencoder2023 introduces CV-SLT, employing conditional variational autoencoders to address the modality gap between video and text. They assess the disparity using RWTH-PHOENIX-Weather-2014T data, correlating similar embeddings with improved BLEU scores. Their approach involves guiding the model to encode visual and textual data similarly through two paths: one with visual data alone and one with both modalities. Using KL divergences, they steer the model towards generating consistent embeddings and accurate outputs regardless of the path. Once the model achieves consistent performance across paths, it can be utilized for translation without gloss supervision. Evaluation on RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@dataset:huang2018video] datasets demonstrates its efficacy. They provide a code implementation based largely on @chenSimpleMultiModalityTransfer2022a.

Which I think ought to take care of all the issues:

[x] sums it up well,
[x] fixes the consistency issue for RWTH-PHOENIX-Weather-2014T
[x] should have the correct citation style, parentheses around the dataset citations and none around the ones that shouldn't have them
[x] One sentence per line

AmitMY commented 1 month ago

Sorry, yes, I meant ask chatgpt to help with your writing. you write a draft, it gives tips to improve

cleong110 commented 1 month ago

How do you typically prompt ChatGPT for writing improvement suggestions, btw? I just asked it to help me rewrite for conciseness, being sure to mention it was a summary of an academic paper. Wondering if you've got some particular prompting tricks that work well

sign-language-processing / sign-language-processing.github.io

adding "Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment" #37