neulab / awesome-align

A neural word aligner based on multilingual BERT
https://arxiv.org/abs/2101.08231
BSD 3-Clause "New" or "Revised" License
319 stars 46 forks source link

Wrong mapping with non-matching sentences #9

Open mzeidhassan opened 3 years ago

mzeidhassan commented 3 years ago

Hi awesome-align team,

First, thanks for the great tool. It has really great potential.

I am following your Colab demo, and I tried to align English to Arabic. Here are the 2 sentences:

src = 'I will meet you there. It is a very cool weather today.' tgt = 'سوف أقابلك هناك.'

The Arabic sentence matches the first English sentence in src, i.e. "I will meet you there". The second sentence in src "It is a very cool weather today." doesn't exist in Arabic.

When I run the code, I get a very strange result, and I am not sure where the culprit is.

This is what I get:

I===سوف
I===أقابلك
will===أقابلك
meet===أقابلك
you===أقابلك
there.===هناك.
today.===هناك.

For some reason, most of the second English sentence is not showing up, plus there are now wrong mappings "(today) is wrongly mapped to (there)' for example.

If I remove the second sentence in src, the result looks really good.

I want to use Awesome-Align to detect non-matching strings in a bilingual dataset, so I can exclude the wrong and non-aligned sentences.

Is there a way to add alignment scores, so it is easy to filter out bad aligned sentences?

Also, is there a way to visualize the mapping? Something similar to SimAlign mapping.

After all, it could be that Awesome-Align is not designed for my purpose, but I hope you consider this idea in a future release.

Thanks in advance for your support, and thanks for the awesome tool :-)

zdou0830 commented 3 years ago

Hi,

Thanks for the interest!

In the demo, we only print the aligned word pairs, thus most words in the second English sentence are now showing up because our model does not find any corresponding target words for them (which is a good thing). "I" appears twice because the model thinks it is aligned to two words in the target sentence, and "today." is mapped to "(there)." because there are "." in both of the words (remember that the inputs should be tokenized).

I am not sure if awesome-align can be used for filtering non-parallel sentences. I guess one thing you can do is score each sentence pair by computing the number of extracted word pairs divided by the sentence length, then filter sentence pairs whose scores are below a pre-defined threshold. Also, awesome-align does have a parallel sentence identification objective and our models can be trained to detect non-parallel sentences. However, I haven't tested it in this scenario and I don't know if these strategies would be better than or is complementary to existing techniques like dual conditional cross-entropy filtering.

I did test a few ways to generate alignment scores, but it is still worth investigating if the scores make sense or are well-calibrated.

The demo provides a (poor) visualization for the mappings and I will try to make it nicer :)

jinyiyang-jhu commented 3 years ago

I have a similar question: what if a src token is not aligned to any target token (or tgt token not aligned to any src token)? If so, how should we preprocess the gold alignment, and will such token be printed in the hypothesis? How will the AER be calculated? In the "example/", both the reference and hypothesis are i-j paired so I'm not sure how to process something like -j or i-, the aer.py script won't work with that.

zdou0830 commented 3 years ago

I have a similar question: what if a src token is not aligned to any target token (or tgt token not aligned to any src token)? If so, how should we preprocess the gold alignment, and will such token be printed in the hypothesis? How will the AER be calculated? In the "example/", both the reference and hypothesis are i-j paired so I'm not sure how to process something like -j or i-, the aer.py script won't work with that.

Hi @jinyiyang-jhu, the reference/outputs only contain aligned word pairs. If the i-th source word is not aligned to any target words, there would be no i-* in the reference/outputs.

juncaofish commented 2 years ago

I encapsulate alignment calculation into a separate method using simple harmonic mean of aligned tokens rate on both sides. Comments are welcome for the implementation. @mzeidhassan https://github.com/juncaofish/awesome-align/blob/master/awesome_align/aligner.py#L125

mzeidhassan commented 2 years ago

Thanks a million @juncaofish ! I will give it a try when I have a chance.