Training Details about "Alignment Layer" and "Alignment Optimization"

Jason-Young-AI commented 3 years ago

Hi,

I have read your paper "End-to-End Neural Word Alignment Outperforms GIZA++" and "Adding Interpretable Attention to Neural Translation Models Improves Word Alignment". The methods mentioned in your work are very interesting and the experiment results are very good.

Now I am carrying out research in machine translation task, and my work needs good soft alignment results, so I want to implement “alignment layer” and “alignment optimization” to obtain soft alignments for each sentence pair.

Since I can't find the open source code, I need to confirm with you the following questions, or if it is convenient for you to provide open source code?

In appendix A, table 5 shows that Embedding Size is 256 and Hidden Units is 512, is there a linear layer (256*512) here to convert input embedding (256) to hidden states (512)? Or Hidden Units is just a feedforward dimension? so what does Hidden Units mean?
I only know that you used 90k updates for training translation model. What are the configurations of optimizer and lr scheduler when training translation model? Is it the same configuration as the transformer base model?
After training translation model, did you reset the optimizer and lr scheduler to train the alignment layer? Does the optimizer still use Adam the same as the translation model?

thomasZen commented 3 years ago

Thank you for your questions.

In our implementation hidden units was used as the feedforward dimension.
Yes, the same configuration as people use for the transformer base model.
Yes, I reset both to train the alignment layer and still use Adam.

These parameters should not matter a great deal in my experience. I would recommend:

Use a transformer translation model that you've already successfully trained.
Implement the alignment layer with the same dimensions as your translation model. Be sure to attend to the (encoder input + encoder output) features and only use these features to predict the next token in the alignment layer. We also stopped the gradients to the translation model to not affect the translation quality (see section 4 of https://arxiv.org/pdf/1901.11359.pdf). You should get to about 31% (Table 4 of https://arxiv.org/pdf/1901.11359.pdf).
After this works the following issue is helpful to implement attention optimization: https://github.com/lilt/alignment-scripts/issues/3

Since I can't find the open source code, I need to confirm with you the following questions, or if it is convenient for you to provide open source code?

I'm working on an open source implementation, I'll hopefully finish within the first quarter of next year.

Jason-Young-AI commented 3 years ago

Thank you for your detailed reply! I'll reopen the issue if I have new problems.

alvations commented 2 months ago

@thomasZen are there still plans to release the open source implementation of the Bidir. Att. Opt. approach in the paper? https://arxiv.org/pdf/2004.14675

thomasZen commented 1 month ago

Hi @alvations , No, I don't have plans to implement that currently, sorry. If you are planning to add it to an open source repository, or implement it from scratch, I'm happy to help with any questions and also review some code.

lilt / alignment-scripts

Training Details about "Alignment Layer" and "Alignment Optimization" #6