marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.22k stars 227 forks source link

How to guide translation from the context of previous sentences. #303

Open ysig opened 4 years ago

ysig commented 4 years ago

Hi. I am new to marian-nmt and as I am fascinated with its capabilities I was curious if there was a support for the following problem:

Let's say I want to build a translation system that translates documents.

In the language I want to use (greek modern), if you translate the document sentence by sentence, there is a possibility that even a human will not be able to derive only by the context of a single sentence if for example the word drug can refer to pharmaceutical drugs or street drugs (which are translated in a different way in greek: "φάρμακα" and "ναρκωτικά" respectively) or if the word "them" in for them refers to people or to objects which will be translated in a different way ("για αυτούς", "για αυτά").

As so I was curious if there is a way so that the translation process takes into account, when translating a sentence in a document, the vectorial representation of previous sentences that it inherentely produces from them (let's say when using gru's), so that the result of the present one becomes more accurate*.

In any case if you have anything to propose, I would really like to hear it.

Thanks in advance!! PS: Cheers for this amazing package :)

* I guess that the problem there is how to handle change of context.

emjotde commented 4 years ago

Hi, The problem here depends on how you have implemented your document-level system. With current Marian I would say there are two ways to achieve that out-of-the-box with no to little need to code anything:

The effect you wish for should appear automatically if you are lucky.

ysig commented 4 years ago

Hi,

thanks a lot for this really precise answer! For that, I have two small questions:

Thanks a lot!


Additionally (for the future reader):

Is there a way to apply learning updates in a consistent document level, instead of a sentence level?

If we think that a set of sentences in a document are like pictures in a video (continuous semantically with abrupt changes - but a bit less continuous), then we would like to signify start of a document and end of a document as we have with start of sentence and end of sentence. The same analogy can be applied to sequences and paragraphs.

Then we would like to do update for each sentence or for a batch of sentences in a paragraph (not of a constant size!). We would also like to know where a paragraph starts and when it stops and know the beginning and the end of a document (this of course is not the case for europarl type data).

If there is something in marian-nmt that approximates this logic, please let me know :)