How to guide translation from the context of previous sentences.

ysig commented 4 years ago

Hi. I am new to marian-nmt and as I am fascinated with its capabilities I was curious if there was a support for the following problem:

Let's say I want to build a translation system that translates documents.

In the language I want to use (greek modern), if you translate the document sentence by sentence, there is a possibility that even a human will not be able to derive only by the context of a single sentence if for example the word drug can refer to pharmaceutical drugs or street drugs (which are translated in a different way in greek: "φάρμακα" and "ναρκωτικά" respectively) or if the word "them" in for them refers to people or to objects which will be translated in a different way ("για αυτούς", "για αυτά").

As so I was curious if there is a way so that the translation process takes into account, when translating a sentence in a document, the vectorial representation of previous sentences that it inherentely produces from them (let's say when using gru's), so that the result of the present one becomes more accurate*.

In any case if you have anything to propose, I would really like to hear it.

Thanks in advance!! PS: Cheers for this amazing package :)

* I guess that the problem there is how to handle change of context.

emjotde commented 4 years ago

Hi, The problem here depends on how you have implemented your document-level system. With current Marian I would say there are two ways to achieve that out-of-the-box with no to little need to code anything:

Concatenating multiple parallel sentences into document-length sequences. Then you just pretend that your whole parallel document is a parallel sequence. This works pretty well, our recent work on that topic: https://www.aclweb.org/anthology/W19-5321/
Using a multi-encoder architecture, where one encoder is used for the current sentence, and one for the context. We used architecture like that for automatic post-editing, but document-level context would work as well: https://www.aclweb.org/anthology/I17-1013/

The effect you wish for should appear automatically if you are lucky.

ysig commented 4 years ago

Hi,

thanks a lot for this really precise answer! For that, I have two small questions:

In the first case did you need to organise training data in a document format?
In the second is the algorithm robust to a change of context? So if you give it training data in raw stream of sentences (like europarl) and train it (with mini-batch) will it "understand" the change in context (does it, at least empirically) - or will it group drugs together with coffee? Is there also a way to avoid that?

Thanks a lot!

Additionally (for the future reader):

Is there a way to apply learning updates in a consistent document level, instead of a sentence level?

If we think that a set of sentences in a document are like pictures in a video (continuous semantically with abrupt changes - but a bit less continuous), then we would like to signify start of a document and end of a document as we have with start of sentence and end of sentence. The same analogy can be applied to sequences and paragraphs.

Then we would like to do update for each sentence or for a batch of sentences in a paragraph (not of a constant size!). We would also like to know where a paragraph starts and when it stops and know the beginning and the end of a document (this of course is not the case for europarl type data).

If there is something in marian-nmt that approximates this logic, please let me know :)

marian-nmt / marian

How to guide translation from the context of previous sentences. #303