lilt / alignment-scripts

Scripts to preprocess training and test data and to run fast_align and giza
MIT License
108 stars 23 forks source link

*question* How extract alignments from alignment layer ? #2

Closed XiaoqingNLP closed 5 years ago

XiaoqingNLP commented 5 years ago

Thank you for your great work "Adding Interpretable Attention to Neural Translation Models Improves Word Alignment", In this article, the alignment layer output is a target word, so how to get alignments results? the alignments results is generated by attention weights A?

thomasZen commented 5 years ago

You can extract the attention activations during the forward path of the network. If you only use 1 head in the multihead attention on the source representations and your source and target sentence has srcLen and tgtLen (sub)words, you get a matrix of the shape srcLen, tgtLen. For each target (sub)word you can find the source subword with the maximal attention value (e.g. using np.argmax), which will give you an alignment for each target subword to one source subword.

XiaoqingNLP commented 5 years ago

Thank you for your reply .

XiaoqingNLP commented 5 years ago

can you give me some details to obtain file (like .talp) and compute the aer ? when I get a target subword to one source subword, when I preprocess the subword alignments to a word alignments and it is uncompatible since the subwords-alignments greater than both srclen and tgtlen 1.

thomasZen commented 5 years ago

You can use sentencepiece_to_word_alignments.py to convert subword alignments to word alignments (If you don't use sentencepiece, you have to adapt this script). After you created the word alignments, you can use aer.py to calculate the Alignment Error Rate.

SkyAndCloud commented 4 years ago

Sorry I'm new to python. How to adapt this script to convert subword alignment to word alignment? Can you give an example?

thomasZen commented 4 years ago

@SkyAndCloud The script sentencepiece_to_word_alignments.py works out of the box to convert alignments between sentencepiece units to word alignments. Sentencepiece units use a special token to represent a space. Here's an example:

hello.source: ▁hel lo ▁align ments ! hello.target: ▁hallo ▁ausrich tungen! subword_alignments.talp: subword_alignments.talp

With these files you can run ./scripts/sentencepiece_to_word_alignments.py hello.source hello.target < subword_alignments.talp and you will get word alignments as an output: 0-0 2-1 4-2

Let me know if that works for you and if it does not, please provide a minimal example that illustrates at which point you're stuck.