Allow or disallow multiple target entries

transl8bzimport commented 13 years ago

Originally posted by Marce van Velden:

For some projects it is usefull to disallow multiple target entries for the same sourcetext+sourcelang combination (to prevent cluttering of the tm with many alternatives for the same sourcetext.

transl8bzimport commented 13 years ago

Originally posted by Marce van Velden:

Created attachment 760

This patch allows to set this option (in settings.py)

MULTIPLE = TRUE -> behaviour as it was, ie to allow multiple entries MULTIPLE = TRUE -> disallow/delete multiple entries

Some things to consider:

using a unique indexes to make for performance reasons
making this settable per source language and not the whole tmserver

friedelwolff commented 13 years ago

I would like to see a stronger argument made by means of a use case. As a counter-example: how do we handle noun/verb ambiguities? With this you can't store translations for ambiguous strings like View - Aansig (noun) View - Bekyk (verb)

or

Empty trash - (command) Empty trash - (description)

Tying this with a source language makes no sense in general, of course, only if you if you use the source language as a way to group project-related information as I understand you do.

So we'll need to flesh this out better, I think.

transl8bzimport commented 13 years ago

Originally posted by Marce van Velden:

The examples you give might be more suitable to (also) be stored in a term bank.

For longer segments I believe it is rare that you want to store different translations of the same segment. In most cases we only want to keep the best (which hopefully last correction of the sentence). This of course requires to not not everyone to submit entries directly in the main TM (as bad translators might ruin perfectly good sentences.

Another option of course is to store everything but allow the client to request for max x suggestions of the same segment (ordered by submission date)

The use case I know is that most tm software in the localization industry supports this feature and that in our company we use the MULTIPLE = FALSE feature most of the time.

Another reason for this is that it help in keeping you TM's clean, which is better not only when serving suggestions to the translator, it also helps when we want to train an MT engine from the TM(s)

friedelwolff commented 13 years ago

(In reply to BZ-IMPORT::comment #3)

The examples you give might be more suitable to (also) be stored in a term bank.

Sure, view/view can be stored in a term bank, but that is irrelevant here. It can be stored in TM as well, since we might translate something with such an ambiguity, and if we artificially keep one of them out, we are just skewing our results in some random direction (towards whatever happened to be the last one imported).

A more convincing example for ambiguities might be anything where the object/subject of a sentence is "it" that refers to something in a previous segment or other context. In a language with noun genders like French or German, the "it" would be translated differently, and this has nothing to do with terminology, but simply about "one-size-fits-all" not being good enough. Tell me if I'm not making sense :-)

For longer segments I believe it is rare that you want to store different translations of the same segment. In most cases we only want to keep the best (which hopefully last correction of the sentence). This of course requires to not not everyone to submit entries directly in the main TM (as bad translators might ruin perfectly good sentences.

I agree that later translations are likely to be better sometimes if we have a workflow where improved translations are imported to replace older translations. Keeping different translations for the same source text is completely realistic, especially if we consider the initial design of amagama to combine translations from different projects (clients, text types, etc.) all together. The initial focus was maybe on doing post-processing or sorting after retrieval, whereas I think your current approach is to store and query things entirely separately. So I think that is at the centre of the divergence here, but tell me if you think I'm summarising it wrong. In that sense it is just a small mismatch between our respective expectations and the way we currently use it.

Another option of course is to store everything but allow the client to request for max x suggestions of the same segment (ordered by submission date)

In general this is closer to what I had in mind. We already allow the client to specify the number of suggestions, so we'll only need to associate a timestamp and use that to influence ranking. It would increase the size of the database and make queries a bit more expensive, but I think it solves this problem in a slightly (arguably) better way.

The use case I know is that most tm software in the localization industry supports this feature and that in our company we use the MULTIPLE = FALSE feature most of the time.

Maybe I'm not clear on what I mean with a use case. What I meant is that I'd like to hear "stories" of a user that don't want all the suggestions, rather than a comparison with other tools. I understand that you implement it because you want it this way :-) I'm just trying to figure out if this is necessarily an important feature, a good default behaviour or even worth maintaining in general, although I can see at least some of the value. If it is only an issue of the value of cleaner data at the cost of less (valid) translation alternatives I'd like to know what it is we are working with.

Do you maybe have some references to documentation of the other tools you refer to? I'd like to see if there is some kind of reasoning behind it that might be explained in their documentation. I might be thinking in the wrong direction.

Another reason for this is that it help in keeping you TM's clean, which is better not only when serving suggestions to the translator, it also helps when we want to train an MT engine from the TM(s)

TM maintenance is an interesting argument. My initial idea was to always rebuilt the amagama database from scratch, whereas I understand that you feel you'd like to simply provide the improved files and re-do the import over the same database and have old (hopefully wrong) entries disappear. However I'm not convinced that training for MT is a good argument, since you want the full variety of valid translations to be in your training data, which is again an argument against removing things that might be there for valid reasons, like the ambiguity I mentioned.

Please don't get me wrong: I think this is an interesting idea, but I'd like to make sure what the end goals are before we just blindly put it in the code. Thanks for your perseverance :-)

transl8bzimport commented 13 years ago

Originally posted by Marce van Velden:

(In reply to BZ-IMPORT::comment #4)

In a language with noun genders like French or German, the "it" would be translated differently, and this has nothing to do with terminology, but simply about "one-size-fits-all" not being good enough. Tell me if I'm not making sense :-)

That makes sense, so therefore this option should be settable, depending on your use case. And if you do allow different translations for the same source text it should be possible to easily filter these out with a manage command afterwards, whenever required. I believe this should also be an option in tmx export.

A reason for not allowing this could be that we dont want to keep unreviewed translations in our tm (comparable to suggestions in pootle). I.e. translater translate a given group of documents, this results in a tm. After this the translations are reviewed and corrected. We only want to store the corrections (and the segments which were correct allready). ALthough I believe it would be another idea to store a user id to the tm entry, which would allow for more management options (trust levels etc)

Another option of course is to store everything but allow the client to request for max x suggestions of the same segment (ordered by submission date)

In general this is closer to what I had in mind. We already allow the client to specify the number of suggestions, so we'll only need to associate a timestamp and use that to influence ranking. It would increase the size of the database and make queries a bit more expensive, but I think it solves this problem in a slightly (arguably) better way.

I agree that it should be possible to keep a history of translations. But this requires more tm management work, which should be made easily available. Though I can assure you our project managers prefer to not to have to do this management most of the time. There should be added general tm management options, options to managa what gets exported and options in the client request (to allow the client to request only the last segment or the last segment checked by a trusted user, etc. Still to prevent the tm from exploding and keep the tm consitent (if you dont do management), i believe it should be an option (not default) to not allow multiple translations. THe benefit of this is that it is easy and does not require all the management

The use case I know is that most tm software in the localization industry supports this feature and that in our company we use the MULTIPLE = FALSE feature most of the time.

Maybe I'm not clear on what I mean with a use case. What I meant is that I'd like to hear "stories" of a user that don't want all the suggestions

Hmm, well this actualy is the case, our translators only care for the best option (there are multiple solutions for that). they dont want their screen to be cluttered with multiple alternatives.

Another reason (for us) is that in a translation job we will often create a tm just for the job and will import it in a broader memory after completion of the job. In this job tm we only want the final translations (and not the intermediate steps to get there).

So in summary, I believe this option should be available to allow to avoid unnecessary duplications in the translation memory in a simple way. There are other management options that should be implemented to allow for more powerfull tm segment management.

friedelwolff commented 13 years ago

Hi Marce. Sorry for getting behind on this. Real life caught up a bit with me. In the meanwhile I've thought of a possible solution for this to hopefully accommodate both our preferences. I'd like to hear what you think:

What about a setting to indicate the length of source strings up to which different target translations are allowed? Let's say the setting is at 100 characters, amagama accepts all translations of texts shorter than 100 characters, but for longer segments, it overwrites the targets. A setting of zero means no alternatives allowed (what you want) and a setting of -1 means any number of alternatives allowed (what I want).

What do you think about that?

translate / amagama

Allow or disallow multiple target entries #1966