php-gettext / Gettext

PHP library to collect and manipulate gettext (.po, .mo, .php, .json, etc)
MIT License
687 stars 134 forks source link

Duplicate translations with different XLIFF unit IDs are not maintained #224

Open asmecher opened 4 years ago

asmecher commented 4 years ago

This is a knock-on issue from https://github.com/oscarotero/Gettext/issues/220. (I hope you won't regret adding that already!)

An XLIFF file may contain several units with the same translation content... (silly example but you get the point)

<unit id="am.permanently">
  <segment>
    <source>am</source>
    <target>soy</target>
  </segment>
</unit>
<unit id="am.temporarily">
  <segment>
    <source>am</source>
    <target>estoy</target>
  </segment>
</unit>

However, when loading the translation from the XLIFF file, only one gets loaded into $translations. This is because the two will have the same ID, based only on the source text ("am"), without considering the unit ID.

Thus trying to save the loaded XLIFF again will drop one of the two entries.

asmecher commented 4 years ago

@oscarotero, I am happy to code up a solution, but as with https://github.com/oscarotero/Gettext/issues/220 I'd like a little guidance about the best option because I'm not familiar with how the translation IDs are used elsewhere in the library.

oscarotero commented 4 years ago

Mmm, I don't know how to solve this.

On the one hand, the unit id is not unique, on the other hand, the source is not unique either. I guess the uniqueness of a translation in Xliff is a sort of combination of unit id + source?

asmecher commented 4 years ago

I guess the uniqueness of a translation in Xliff is a sort of combination of unit id + source?

I don't think XLIFF has any requirements of uniqueness involving the source text. The XLIFF 2.0 spec does talk about uniqueness of IDs, but that's pretty much it. For example, by my read, each unit (=Translation class instance) can be uniquely identified by a combination of unit ID and file ID (see section 3.1).

I think it'll be necessary to inform the mechanism for ID generation (currently the static generateID function) based on the format of the file being sourced.

oscarotero commented 4 years ago

In gettext, we have the following structure:

Right now, the equivalent in Xliff would be:

A possible solution could be to get the id of the unit as the msgid, instead the source, and store the source as a comment. So in a exported .po file, we could get:

# XLIFF_SOURCE: source-element
msgid "unit-id"
msgstr "translation"
asmecher commented 4 years ago

@oscarotero, but wouldn't that be a radical departure from either/both the PO and XLIFF standards? My understanding of .po files is that msgid should always be the source language, likewise source elements in XLIFF.

oscarotero commented 4 years ago

I was thinking in how to use these translations in the html templates. Using the unit-id as the msgid, you can get the translations by that id:

<h1><?= __('unit-id') ?></h1>

But, at the same time, if you scan this template searching for gettext entries, this translation will have "unit-id" string as the original untranslated string. So, finally, we can assume that the unit id and the source are the same thing (or at least, share the same purpose: to identify a translation).

I don't know, it's hard to fit the Xliff format into gettext without lossing data. I'm open to other possible workaround.

asmecher commented 4 years ago

@oscarotero, we already have large parts of an internationalization toolset in place, including a caching layer that'll go in front of Gettext and tools in our PHP code and Smarty templates for getting translations. What I'm hoping to use Gettext for is specifically its XLIFF load/save features -- i.e. just the XLIFF extractor and generator, but not the Translator class. My operations on the translations always deal with the full list of translations -- i.e. foreach($translations as $translation) {...}

We're introducing Gettext in an attempt to move from our own bespoke XML files to the XLIFF standard. We're using symbolic IDs instead of English-language text in our source code, hence the use of the unit ID in XLIFF. Our goal is to facilitate the use of commodity translation tools like POEdit and Weblate, thus we need the XLIFF content to be usable within those tools (and I think using msgid or <source> with the unit ID will break them).

I think the purpose of generateID etc. is to make it easy for the Translator class to fetch a translation quickly, correct? I can see two options:

oscarotero commented 4 years ago

Maybe the indexing/deduplication tools belong in Translator rather than Translations?

Translation has the getId() and the static generateId() methods, so the deduplication belongs to Translation. This value is important not only to search translations but also to merge different Translations instances.

The right solution could be add a setId() method, so you could choose another different way to define the ids/deduplication and use generateId internally as a fallback. The only problem is Translations::find that currently depends on this method to find the translations. Removing or changing this method is a breaking change, and I don't want to release a major version right now. And if you define a different id, you can't find a specific translation by context + original. Using a foreach is discarded for performance reasons.

Maybe we could add the Translation::setId() method with a warning that using this method makes imposible to search the translation in a Translations instance using the context + original.

And because Xliff generator use this method to generate ids, include the warning here too.

asmecher commented 4 years ago

Thanks, @oscarotero, I submitted this according to the suggestions you made: https://github.com/oscarotero/Gettext/pull/225

I think there is still one issue to resolve, as noted in the PR comment.

asmecher commented 4 years ago

From your earlier comment:

A possible solution could be to get the id of the unit as the msgid, instead the source, and store the source as a comment. So in a exported .po file, we could get:

# XLIFF_SOURCE: source-element
msgid "unit-id"
msgstr "translation"

My reaction was...

@oscarotero, but wouldn't that be a radical departure from either/both the PO and XLIFF standards?

...but it turns out many projects actually do use PO files written this way (with symbolic IDs in the msgid), and it's supported in translation tools like Weblate (known there as "Monolingual gettext").

Just documenting something I learned in the hopes it'll be useful to someone else :)

ShawnMilo commented 2 years ago

I'm maintaining XLIFF tools for a free site: https://xliff.tools/

Does the "ID Replacement" tool do what's required? If not, I'm interested in learning more about the issue and possibly implementing it.

Thanks!