pkp / pkp-lib

The library used by PKP's applications OJS, OMP and OPS, open source software for scholarly publishing.
https://pkp.sfu.ca
GNU General Public License v3.0
304 stars 444 forks source link

Replace bespoke translation toolset with more standards-based options #4779

Closed hsluoyz closed 4 years ago

hsluoyz commented 5 years ago

Currently, all translations are done in XML files, like mentioned in: https://github.com/pkp/pkp-lib/issues/4029#issuecomment-417907420, which is very inefficient for translators to translate, or sync a few of items between dozens of XML files.

Is there any chance to use a more advanced online translation platform like: https://crowdin.com/ ? In crowdin, all translators only need to do the translation in web browser, and no need to track which words have not been translated yet. The translation will be deployed automatically with a new git commit. Can we consider it?

asmecher commented 5 years ago

@hsluoyz, we have been hoping to replace our own translation tools with something else for some time, but unfortunately have not been able to make much progress on it. I was not aware of Crowdin, which does appear to have a free open source/academic plan:

Can I get an Open Source or a free Academic License?

Yes. If you want to use Crowdin for an Open Source project, sign up for a free account, set up your project and send us a request. Apply for an Academic License if your project has educational purposes. Each granted license will include an unlimited number of projects, strings, and members.

We're hesitant to use freemium services for necessary elements of the software, but some of the other translation options we've been considering are freemium as well.

(Tagging @mtub and @marcbria)

marcbria commented 5 years ago

As you know I'm not a fan of a privative software, so my vote will be always no... and less when we have free alternatives (free as in freedom. I'm ok if it's a paid service) that covers all the requirements (github/lab integration, import/export, translation memories, glossaries, multiple formats...). If somebody is interested, we made a comparative of the requirements.

So Heildelberg we started a weblate instance, that it's still up and running. I can keep it in production for PKP if you need hosting... but we need somebody with time to set up it all correctly. OJS native files need to be parsed and converted to XLIFF, and then mapped in the tool.

IMHO, it's a huge task at the beginning, but will make the translation task a piece of cake and facilitate the integration of non tech profiles in the translation team.

If we are not going to host our own tools (that IMHO is an error ;-)), SaaS could be an option, but... a) it need to be something based on free software... b) and we need to be completely sure we can move everything outside the tool if in future we don't like the service conditions.

Weblate SaaS accomplish with those 2 conditions while crowdin doesn't.

You all know what happens when we trust in proprietary tools that start as "free of charge" to get enough people, and then move to a restrictive business model.

hsluoyz commented 5 years ago

Hi @asmecher ,

I was not aware of Crowdin, which does appear to have a free open source/academic plan:

In fact, I'm using Crowdin in the docs site of my own project: https://crowdin.com/project/xxx, the site is here: https://xxx.org/. You can see there's a "English" button in the top to switch the translation. Of course there are many popular projects using it (see here) including Minecraft, Khan Academy, GitLab. I'm also recommended by other people about it, and currently it seems to be the No.1 popular online translation platform (correct me if I'm wrong).

HI @marcbria ,

As you know I'm not a fan of a privative software, so my vote will be always no... and less when we have free alternatives (free as in freedom. I'm ok if it's a paid service) that covers all the requirements (github/lab integration, import/export, translation memories, glossaries, multiple formats...). If somebody is interested, we made a comparative of the requirements.

I think Crowdin has covered these requirements (GitLab integration not checked, as I'm using GitHub only).

a) it need to be something based on free software...

Using self-hosted translation tool indeed gives ourselves more control. But it also brings many more efforts. The main task of this project is academic journal manuscript software, not a translation software. We don't need to build or host all services on our own. GitHub is actually a non-free software but we are still using it for free and open-source projects, right? GitLab is still not as popular as GitHub.

We can let professional people do their professional job. Currently, this project (ojs) is already short in person as many translations are not complete (at least in Chinese as I checked). We don't have the efforts to build/host a translation system.

b) and we need to be completely sure we can move everything outside the tool if in future we don't like the service conditions.

I understand your concern, but as I said above, much larger and popular projects like Minecraft, Khan Academy, GitLab are already using Crowdin. We are not the one to be hit first when the sky falls. Even if one day, Crowdin broke up, We still get the all translation files (which will be stored in our repository). It's no worse than current. We have nothing to lose.

marcbria commented 5 years ago

Using self-hosted translation tool indeed gives ourselves more control.

Sounds like a good idea to me.

But it also brings many more efforts.

Not necessarily. Weblate offers free hosting for free software projectes. If not, I offer my servers for free.

The main task of this project is academic journal manuscript software, not a translation software.

Thanks for sharing your thoughts about the goal of OJS and PKP, but I think you are missing the whole picture. From my perspective PKP project it's not only about tools... it's mainly about "Public Knowledge" and, as said, if when we have free alternatives, I have no doubts it's the way to go. We need to ensure we don't depend on proprietary initiatives and supporting free software is also a way to empower the whole community.

We don't need to build or host all services on our own.

Of course we don't, but we can if we like. At the end, moving from our own translation tool to a community build one it's also a way to optimize our dev resources.

GitHub is actually a non-free software but we are still using it for free and open-source projects, right?

And I think is an error, but don't get me started... ;-)

hsluoyz commented 5 years ago

I'm also OK with weblate. It didn't know it before and found it to be very excellent after some googling. Hope this platform would be ready soon so we can get started to translate now..

marcbria commented 5 years ago

I'm also OK with weblate. It didn't know it before and found it to be very excellent after some googling.

Great. :+1:

Hope this platform would be ready soon so we can get started to translate now.

Me too, but there is a lack of hands. :-(

Never mind what platform we use... in all cases, we need to translate our native XML to something standard (XLIFF sounds like a good plan), then setup the tool to define translation units and set the git-whatever exportation, and after this, change OJS (or every OxS tool) to read XLIFF instead of our native XML... and right now I have my hands full.

If somebody is interested in doing the job, I'm pretty sure he/she will make Marco be very happy. ;-)

Till then, I'm sorry but editing the xmls or using the native translation tool are the only ways the community has to contrib with translations.

@mtub, sorry to annoye you with this, but... you are the boss? ;-P Weblate is fine or you prefer others? Something to address in Pittsburgh or Barcelona sprint this year?

asmecher commented 5 years ago

Here are two draft PRs that alter OJS and pkp-lib to use XLIFF sources instead of the current PKP-specific XML files:

To use them...

  1. Pull in the above modifications to your installation

  2. Go into lib/pkp and update your composer dependencies (composer update)

  3. Convert your locale files from PKP XML into XLIFF:

    for name in `(find locale/*/*.xml && find lib/pkp/locale/*/*.xml) | sed -e "s/xml$//" | grep -v bic21 | grep -v countries | grep -v currencies | grep -v languages | grep -v emailTemplates`; do php lib/pkp/tools/xmlToXliff.php ${name}xml ${name}xliff; done

    (This is equivalent to running php lib/pkp/tools/xmlToXliff.php path/to/source-locale-file.xml path/to/target-xliff-file.xliff for all translations that are present, excepting plugins.)

  4. Flush your file cache: rm -f cache/*.php

This is a work in progress, but should allow experimentation with XLIFF-based translation tools to see how well they work with this toolset.

asmecher commented 5 years ago

(@mbria, @marco: https://github.com/pkp/pkp-lib/issues/4779#issuecomment-496722877)

marcbria commented 5 years ago

@asmecher that looks great.

If I read well, with your changes we have now OJS ready to read XLIFF and, by the same price, also a php helper script to convert local XMLs to XLIFF, isn't it?

You are a really fast coder!! ;-) Thanks a lot!!

BTW, Travis is claiming something here: https://github.com/pkp/ojs/pull/2413 Do we need to worry?

@mtub , with Alec changes, now we only need somebody to configure weblate correctly and make some testing to see if we can integrate weblate with gitlab.

I won't have time for this, at least, till the end of the next month. :-( Is there any body in PKP that can do the job or some money to hire someone?

Cheers, m.

asmecher commented 5 years ago

Hi @marcbria,

BTW, Travis is claiming something here: pkp/ojs#2413 Do we need to worry?

No, don't worry -- I didn't include the converted xliff files with the commits, so the tests will break because of untranslated locale keys. (It won't make sense to commit/maintain converted files until we're ready to take the plunge.)

I think the next step would be to get confirmation from someone who has worked with xliff files that the automatically-converted ones aren't totally crazy. I've attached one here for reference: submission.xliff.txt

marcbria commented 5 years ago

I forwarded your question to our CAT expert, and I hope he will answer in a couple of days. Thanks a lot for your work Alec.

asmecher commented 5 years ago

@marcbria, a few questions I'd want them to consider:

  1. Symbolic vs. English-language keys

We use symbolic locale keys in the code (e.g. navigation.journalHelp), then all locales, including English, are specified in locale files. This differs a bit from the Gettext standard in that usually English-language text would be embedded in the code, then the locale files would provide translations from English into other languages.

As a result, the XLIFF will have translations like this (for French):

     <segment>
        <source>author.submit.submissionCitations</source>
        <target>Fournir une liste structurée de références pour les travaux cités dans cette soumission.</target>
      </segment>

...instead of...

     <segment>
        <source>Provide a formatted list of references for works cited in this submission. Please separate individual references with a blank line.</source>
        <target>Fournir une liste structurée de références pour les travaux cités dans cette soumission.</target>
      </segment>

Will this work e.g. with Weblate?

  1. The distribution of locale files into various directories and repositories

The translations are split between a number of Git repositories:

Within the Application and pkp-lib repositories, there are several locale files (example: pkp-lib), divided roughly into topics. (I'm open to change on this, if it's not a good fit for standard practices.)

Tools like Pootle and Weblate appear to support Projects and Components. Will that mapping match well against our use of multiple repositories and sometimes multiple locale files within them?

marcbria commented 5 years ago

Hi @asmecher

I have been out of the office a couple of days and I missed your last comment. I will read it all in deep next Thuesday but let me advance some questions from Adrià (the CAT expert).

He need more time but at first sight he said he is very much agree with you about this point:

"We use symbolic locale keys in the code (e.g. navigation.journalHelp), then all locales, including English, are specified in locale files. This differs a bit from the Gettext standard in that usually English-language text would be embedded in the code, then the locale files would provide translations from English into other languages."

And he extends with:

"Of course, if the XLIFF file does not contain the original segments, the translation programs will not correctly recognize the file structure and the translators will not be able to translate.

I understand that the problem stems from the conversion process to XLIFF. If I do not remember badly, when creating XLIFF you should ask the converter to leave the targets blank. If you want, pass me the original file (which is behind submission.xliff) and try to take a look."

I send him this one: https://github.com/pkp/pkp-lib/blob/master/locale/es_ES/submission.xml

I planned to meet him next week and look together weblate to see if we can make PKP a proposal that I think you won't be able to refuse. (Right now, I can't say more) ;-)

Cheers, m.

asmecher commented 5 years ago

Thanks, @marcbria, sounds very intriguing! The XLIFF conversion tool was put together fairly quickly and there are surely a lot of ways of adjusting it. The submission.xliff file linked above comes from https://github.com/pkp/pkp-lib/blob/master/locale/en_US/submission.xml.

hsluoyz commented 5 years ago

Hi, any update on this?

marcbria commented 5 years ago

Not yet. Sorry. Let us one or two more weeks.

veotax commented 5 years ago

Hey guys, any progress on this issue?

marcbria commented 5 years ago

Nop. Thanks for your interest, and sorry again. We have a meeting next week that (hopefully) will offer some light in some issues we still need to fix.

marcbria commented 5 years ago

We arranged a meeting with some CAT experts for tomorrow night. BTW, if somebody is an expert translator (good knowledge of translation formats and tools) opinions and suggestions are welcome.

veotax commented 5 years ago

Any update?

marcbria commented 5 years ago

Sorry again for the silence. I'm overwhelmed and sometimes is difficult to find time to write down what happened.

Long story short:

@asmecher and @mtub what do you think about talking about this the next technical meeting? or do you prefer a different space?

ctgraham commented 5 years ago

We are talking about this at the PGH Sprint, and Slack is down. Did we decide on .po files as the new standard?

Enjoy,

Sorry again for the silence. I'm overwhelmed and sometimes is difficult to find time to write down what happened.

Long story short:

@asmecherhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fasmecher&data=02%7C01%7Cctgraham%40pitt.edu%7Ca0e76fd1f91e48140ee608d70fc3a527%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C636995205269975984&sdata=mPIowXjBoLWjvqIzSMOSiV4F8P%2Faf2mp6nvRxa4Ns4o%3D&reserved=0 and @mtubhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmtub&data=02%7C01%7Cctgraham%40pitt.edu%7Ca0e76fd1f91e48140ee608d70fc3a527%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C636995205269975984&sdata=%2BOn%2F3XOPe8B5qKw7d9sBqXhMt1vIjDermxPXnQrqoXQ%3D&reserved=0 what do you think about talking about this in tomorrow's technical meeting? or do you prefer a different space?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpkp%2Fpkp-lib%2Fissues%2F4779%3Femail_source%3Dnotifications%26email_token%3DABVNJ2VLO52754UENVNVCKLQA6GI3A5CNFSM4HNZZ34KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2UWFKY%23issuecomment-514417323&data=02%7C01%7Cctgraham%40pitt.edu%7Ca0e76fd1f91e48140ee608d70fc3a527%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C636995205269985982&sdata=i%2BHwXG6GhU4JHnuaFuhyu2h6O3oSXWJkX2SFgt5KSU8%3D&reserved=0, or mute the threadhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABVNJ2X5B7EIO7J7BNWJJBDQA6GI3ANCNFSM4HNZZ34A&data=02%7C01%7Cctgraham%40pitt.edu%7Ca0e76fd1f91e48140ee608d70fc3a527%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C636995205269985982&sdata=DywcoK7VU3sIyf1WabGrnrZqZeM%2B%2F7tPU2XZVBgx7Fs%3D&reserved=0.

asmecher commented 5 years ago

We figured XLIFF was a better match than PO, IIRC.

marcbria commented 5 years ago

I can't avoid a strong feeling that PO is not the right way so this is why I wanted to dig deeper to find projects working over XLIFF as a monolingual format (I read about symphony, but not more than this).

Talking with the experts, I discovered the XLIFF format was created as a bilingual "transport" format. It means that it's main goal is letting you move from your native format to something that every translation tool could read... but it didn't make much sense to us. I mean, if we are ready to move from our native XML format to something more standard, we won't need extra transformations and things will work smoothly (no real clues about this right now... just a strong feeling).

I mean, our final goal is building a translation server that will be able to read and publish (push/pull) directly to gitHub/Lab, so don't make much sense to me keeping intermediate formats to do the job.

A CAT expert called Marc (yes, we are not original with names) likes to join the team and will help us (please Marc, say something if you are reading :-)). I'm in Mexico right now, so we plan to work on this after summer vacations.

It work will be: 1) Study what formats are people using and make an argued proposal to PKP to take a final decision. 2) Specify clearly how the translation files need to be to let PKP make the code tuning. 3) Setup weblate to help us in the translation workflow (been able to push/pull to/from our gitHub/Lab).

I will be quite busy till PKPBCN19, but probably Nov. sprint will be a good moment to show advances and talk about this.

My question here is if PKP is able to wait till then or we need something before that.

ctgraham commented 5 years ago

We can certainly wait for the "right" solution. In the meantime, at the Pittsburgh Sprint I think we will update the Translation documentation so that it is accurate for the legacy XML files.

marcbria commented 5 years ago

Thanks Clinton. So we have a plan. ;-)

Here in Mexico I completed some visible-missing chains for the es_ES translation but (even me) I'm too lazy to upload to push them to the different repos so I can't imagine translators doing the job. I mean, we all agree in this, but I wanted to show why I think the translation server it's a must.

BTW, I know but here I realized some terms need to be adapted so I found a fellow that likes to join the translation team and create and maintain a es_MX localization (may be over the es_ES version, just fixing main stuff). I hope to convince him to join us in Nov. in Barcelona and extend ojs lang list.

Best wishes for the Pittsburgh sprint. I'm sorry to miss this one.

asmecher commented 5 years ago

I did a little bit of experimenting with POEdit 2.2 (which has Xliff support and also built-in crowdin dot com support).

If we don't mind running tools over the XLIFF files, we should be able to represent our translations adequately using both English text and symbolic locale keys. This should make them usable both in OJS and by third-party translation tools.

For each locale key in a non-English language, it would look like this inside the XLIFF file:

<unit id="locale-key-goes-here">
    <segment>
        <source>English text goes here</source>
        <target>Translated text goes here</target>
    </segment>
</unit>

(Per XLIFF requirements, we would need to convert locale keys to use dashes - instead of periods . as separators.)

When OJS loads the XLIFF files, it can ignore the <source> text, and just use the unit ID to identify the text. Thus both OJS and the XLIFF editor are happy.

In workflow terms, this will mean creating a tool that will munge non-English XLIFF files as follows:

These are all things the translator plugin already does, albeit with our XML files rather than XLIFF.

It means we'll periodically have to run this over the translation files, e.g. working that into our translation schedule, but I can't think of any downsides beyond slight inconvenience.

mtub commented 5 years ago

This sounds good. My main concern stays that there will be translators who want to directly work on the translation files without using any other translation tool (POEdit, translation plugin…). It could still be possible, but how would we make sure that the source entries (English text) will stay unchanged? Additional comments:

asmecher commented 5 years ago

...How would we make sure that the source entries (English text) will stay unchanged?

We're following standard practice here -- when creating a translation template for the translator to work on, an empty XLIFF file is created by extracting English text directly from the source code using a variety of tools. No changes are ever applied back to the source code -- the <source> contents are essentially always throw-away with XLIFF, and the <target> elements are all that matters.

(Our only diversion from standard practice is that we will, in fact, have an English-language XLIFF file.)

we did not want to have English text in the translated files so that missing translations can be easily spotted and as an incentive to complete the translation. [...] Are we going to step away from this approach now?

No, we're going to keep using symbolic keys in the source code. There's no change here. XLIFF also has a nice flag for translations that need review; we can use that to hint to translators (through translation tools) that something might have changed in the English locale that could mean a change is required in the translator's locale.

marcbria commented 5 years ago

Hi @asmecher,

The proposal you point in your post is one of the 3 options we have in front now. Thanks a lot to take a look in detail and show that is feasible.

I recognize was really difficult to me understand why do we need to keep the English original string in our translation files, but this is how XLIFF. It's a bilingual format (thought for "transportation"... it means, go from one format to something standard). In my head this is still an big overheat... and something useless that add noise instead of make things easy.

Then, I though "why not using XLIFF as a monolingual format". Less chains to keep sync... smaller files but based on a more complete standard than PO. After meeting our CAT experts (please Marc and Adirà say hello to the team to show that you are not my imaginary friends) and with a fellow of the OASIS consortium (specialized in XLIFF) we made preliminary research and we only found a couple of projects using this approach. I like to talk with them further to discover why only a few are doing this and make some testing to know how weblate will work with this.

What I'm still looking for is a KISS solution: I want OJS to work with a format weblate can pull/push directly to github/Lab without any intermediate conversion script. Clean and simple. I want weblate to manage the transaltion workflow and let our translation coordinator (@mtub) pull and push when he thinks work is good enough to create a new branch that dev team can merge.

The point is that more than I research... more doubts I have with XLIFF because it's goal as a format and I'm starting to understand why most of the people is mainly using PO, but I like to test weblate and check all this with Marc and Adrià to be completely sure about what I'm saying.

asmecher commented 5 years ago

...why most of the people is mainly using PO...

There is only one thing that XLIFF offers that I haven't found in PO files: the ability to associate IDs with strings.

In XLIFF files, for each string, we have...

In PO files, we only have

Most projects won't need to use the unit ID for anything, but for OJS, we need it for the symbolic locale key.

If there is a capacity for PO files to include an ID with each string, we could use the same whole toolchain with PO files instead.

marcbria commented 5 years ago

BTW, I asked Oscar... the oscarotero/gettext developer the right path and he suggests PO: https://github.com/oscarotero/awesome-design/pull/1#issuecomment-522146338

asmecher commented 5 years ago

BTW, I asked Oscar... the oscarotero/gettext developer the right path and he suggests PO

I trust his guidance in the general case but we have a few project-specific wrinkles. Using PO files would take us back into a situation we either have to...

I think both of those are show-stoppers.

We can work around that problem with XLIFF as I described above, by embedding the symbolic.locale.key in the unit ID attribute of the XLIFF file so that it contains all 3 pieces of information -- locale key, English text, and translated text. But I don't think PO files have an equivalent ID field.

All of this stems from the decision back in the first days of OJS 2.0 to map from symbolic.locale.keys to localized text, instead of English language into other languages.

NateWr commented 5 years ago

Convert all our source code so the English is hard-coded, then PO files translate from English to Whatever, or

You're all going to hate me but I really think this should be our long-term goal. I think its admirable to try to avoid giving English preference over other languages, but our current approach of using symbolic keys is structured in a way that actively encourages the introduction of bugs during the developer workflow.

The great benefit of PO's approach of putting text directly into the source code is that it puts related concerns together. By separating text from the source code, we regularly introduce three problems when we write new code:

Any translation solution that we aim for should also aim to reduce the number of bugs we introduce to the system and the ease of keeping the codebase clean of outdated stuff.

I've specifically said "aim for" and "long-term goal" because it is probably not feasible for us to jump directly to inlined language. If I understand Alec's XLIFF proposal to use id attributes as symbolic keys, I wonder if this gives us a route towards inlined language in the future.

Would it be possible to support two ways of using translatable strings like the following?

// In the source code
$current = __('locale-key-current', ['number' => 10]);
$total = __('{$total} items in total', ['total' => 100]);
<!-- In the local files -->
<unit id="locale-key-current">
    <segment>
        <source>Showing {$number} items</source>
        <target>Montrant {$number} articles</target>
    </segment>
</unit>
<unit>
    <segment>
        <source>{$total} items in total</source>
        <target>{$total} articles au total</target>
    </segment>
</unit>

In the medium term, our locale script would first look up a translated string by id. If none was found, it would then match against <source>.

If this is possible, this would give us a near-term migration that would get our translators on to sensible tools and allow them to start translating now. Then, over time, we would be able to refactor our source code to use inlined language, gradually reducing the amount of code we maintain the old way and reducing the number of translation-related bugs we introduce.

I understand this would not solve the concern that English is the primary language. But I don't see a way around this. The source code is written in English. English is the only common language for all of our developers and the language in which core development is discussed.

The best thing we can do to support other languages is to make it easy to translate the software and easy to maintain those translations over time. Inlined language will help us do that.

ctgraham commented 5 years ago

Convert all our source code so the English is hard-coded, then PO files translate from English to Whatever...

... I really think this should be our long-term goal.

^ Second.

IIRC, the reason this is a "show stopper" is because it will be painful and initially brittle to make the change. But, if this _('English text') is the industry standard (and we are the only project I know which is using symbolic keys), then we should be moving in that direction.

asmecher commented 5 years ago

I updated the PRs while in transit between different corners of South America without Internet access and am only uploading the results now -- lest everyone think I'm ignoring the discussion above :)

It requires https://github.com/oscarotero/Gettext/pull/221 (which is not IMO ready for inclusion).

I'll consider the rest of this discussion and see how I feel about it in a few days -- thanks for the input, everyone!

asmecher commented 5 years ago

So based on the opinions of @ctgraham and @NateWr I can be convinced to set our long-term goal for what seems to be the industry standard of unilingual text in the code, and mappings from there to other languages. That gives us a series of major transitions...

  1. Change the translation file format from PKP's XML to something standard
  2. Change the translation tools over from our home-brew stuff to something 3rd-party
  3. Change the text in the code from symbolic locale keys over to English-language text.

All three of those are pretty major, so I propose staging it like this:

Stage 1:

Stage 2:

PO and XLIFF are interchangeable in as far as we would use them, so I don't think a conversion later would be a big deal, if we think it's warranted. But if we wanted to use PO files, we'd need to move the English text into the code first.

Don't forget that we'd need to push the changes out to all the plugins etc., so we might want to leave considerable time between stage 1 and stage 2 for the adaptations to take hold!

marcbria commented 5 years ago

@NateWr I can't hate you, but... :-)

There is only one thing that XLIFF offers that I haven't found in PO files: the ability to associate IDs with strings.

Please, take a look to this post: https://phptherightway.com/#discussion-on-l10n-keys

Using "msgid as a unique, structured key" as @asmecher suggested in a former post, is a practice quite extended.

As Alec said, I also suspected it "won't work well with any of the translation editing/creation tools, as they wouldn't have access to the English text to present to translators" so this is why I was asking for time to test weblate/POedit/etc. before taking the decision.

PO and XLIFF are interchangeable in as far as we would use them, so I don't think a conversion later would be a big deal.

I have my concerns about this. In former post Oscar commented his library was not tested deeply with XLIFF files... so it's better if we expect surprises.

If this _('English text') is the industry standard (and we are the only project I know which is using symbolic keys), then we should be moving in that direction.

Well, the maing projects I know a little (Drupal and Wordpress) are hardcoding English text but as pointed before, we are not "the only project using symbolic keys" and there is not an standard here.

More than this, it's a fact that, after 20 years, PO is the the facto standard but the format is quite limited and some big projects (as Ssymfony) are including XLIFF, YAML or JSON in their i18n libraries so looks like something is changing.

I mean, I understand Nate's arguments to include hadcoded English text in code and I really don't mind if English is always the primary language (because it is a fact and nothing to hide), but I still see benefits in keeping our symbolic locale keys. Not a very thought list but:

Symfony people give more arguments: https://symfony.com/doc/current/components/translation/usage.html#creating-translations

In short... if we keep symbolic keys, we have a little bit more flexibility in translations: we don't need to make literal translations and we can adapt it to each local need without been afraid the english chain will change a coma or a capital letter.

Any way, you are right in the fact that we need QA tools to remove locale strings that are not used any more. Hopefully, with standard formats, we can find a tool for this.

So, my personal conclusion here is:

Yes. PHP community is mainly working with PO, Oscar suggested PO, CAT experts won't recommend XLIFF and also suggested PO instead... I mean, if we need to take the decision right now I will go with PO, but I'm a chicken with big decisions and I want to be completely sure. ;-)

But: IF we decide PO is the right move... is to crazy moving directly to PO (with "msgid as a unique, structured key")? Just asking because I imagine moving to XLIFF will be a huge task so if we are going to do it... why not going directly to what we want.

asmecher commented 5 years ago

But: IF we decide PO is the right move... is to crazy moving directly to PO (with "msgid as a unique, structured key")?

If you can find a good, free translation toolset that supports what you're proposing (PO files that are structured using symbolic keys), I'm game. But so far I haven't found one, which means the translators would be left trying to translate some.locale.key into French rather than the actual English source text.

(Thanks for the info on Symfony -- they appear to use Loco for their translations, which presumably would support the locale key-based philosophy, but they aren't FOSS and their free account level wouldn't remotely support our needs -- so I'm hesitant to bank on them.)

marcriera commented 5 years ago

Hello, this is Marc Riera (the other Marc mentioned by @marcbria).

Sorry for jumping into the conversation this late, I would have appeared earlier but I wanted to experiment with possible solutions myself before proposing them and was not able to do so until now.

Basically, given the existing format (strings in XML called by the program using IDs), the easiest solution (as stated some posts above) would be to use XLIFF. This provides three key advantages:

  1. Minimal changes (no need to embed text in the source code, very similar to current approach).
  2. Context-aware translation (identical strings with different IDs can be translated differently).
  3. Standard-compliant format.

The approach would be to use two types of XLIFF files: one for the "base" language (English) containing only the IDs and source text, and another for the rest of languages, containing IDs, source text and target text. This is used in a project I am part of, openBVE, which switched to XLIFF recently after years using key=value text files (check https://github.com/leezer3/OpenBVE, assets/Languages, en-US.xlf is the base file).

I am completely aware of potential issues already mentioned in this discussion, specially keeping all the translation files in sync between languages. Fortunately, there is a great piece of software that would do the hard work for you and allow anyone to contribute to the translation directly from their browser: Weblate (https://weblate.org). It needs to be hosted somewhere, but libre projects may be hosted free of charge under certain conditions (check https://hosted.weblate.org/hosting/ for more information).

I did some tests myself with openBVE and Weblate with excellent results. With repository access configured using an application password, the program adds new strings for translation when they are added in the repository, and pushes translations to the XLIFF files automatically as users translate. The base language file is used as a template, meaning that adding new strings to the base file is enough: they are automatically added for the other language files. There is even the possibility of editing the base file inside Weblate to add, edit and remove source strings, removing the need to directly modify the XLIFF files.

Taking all this into account, using XLIFF seems logical, even if it is not widely used as a final format like and is technically an interchange format. Moving to the PO format in a "second stage" after the switch to XLIFF would then look like a step backwards (having English strings embedded in the source code would not provide any advantage in my opinion).

Regards,

Marc

asmecher commented 5 years ago

@MarcRiera, thankyou, that is very helpful. What you're proposing is my "stage 1" but without stage 2. It's good to know Weblate operates well in this mode; I've already confirmed that poedit does as well, and since poedit has integration with CrowdIn, I suspect that'll serve too. Which suggests to me that we may have a decent ecosystem of translation software to scout through for good github integration.

@ctgraham, @NateWr, @marcbria, do you feel like we're getting towards a workable plan?

NateWr commented 5 years ago

Yes, I think XLIFF with ids now for better translation tooling, with a goal to eliminate some of the technical maintenance issues down the line:

If we can sort those two things out in the long run I'd be happy to keep id-based locale strings in the source code, rather than English.

asmecher commented 5 years ago

Ability to identify and remove unused strings.

There should be tools from the text extraction phase of the standard translation process (whatever fishes English-language text from the code for compilation into a .pot or .xliff file -- I'll investigate these. We'll need to support both PHP and Smarty. I used to have a few homegrown scripts to help with this but they were unreliable. I think the solution will involve some consistently-applied coding standards (e.g. never concatenate locale strings) as much as anything.

Prevent "forgot to load" translation file errors (would be great if all locale strings were available or loaded in automatically somehow).

I've been thinking about this too. We originally had all translations compiled into a single XML for each language, and cached the XML to flat files using our current caching methodology. At the time we considered the cost of loading all strings into memory for each request to be prohibitive; on the one hand, memory is cheaper than it was, but on the other hand each page load now involves many requests and the system has grown more complicated (= more translations).

Glancing at my cache directory, the English text for OJS comes to about 326kb of written-out PHP arrays (which is pretty efficient, size-wise). The more I think about it, the more this pales in comparison with the overall system size -- and as an added benefit, these PHP files are going to be bytecode-compiled and cached by most PHP installations.

So technically I see benefits to going to a single locale file per language (well, per repository, since we'll still have pkp-lib and OJS and plugins to consider). But I'd like word from a few of our translators (@MarcRiera, your opinion is welcome!) on what the impact would be to translators. XLIFF and PO both have organizational tools to sort translations into categories, but I suspect using those would lead us back to the same situation as "forgot to load" errors give us -- e.g. for .po files, a translation in the wrong context is equivalent to a missing translation as far as the calling code is concerned.

marcbria commented 5 years ago

Ability to identify and remove unused strings.

Need to be tested, but this plugin is supposed to do the job: https://docs.weblate.org/en/latest/admin/addons.html#cleanup-translation-files

Prevent "forgot to load" translation file errors (would be great if all locale strings were available or loaded in automatically somehow).

I also have been thinking about this for a while and I have doubts.

My first though is "a project as big like OJS with a single file sounds like a bad idea".

Yes... now memory is cheaper, but thinking this way is bad programming, isn't it? I mean, this approach will make OJS more resource exigent (now is really lightweight) so it won't be a problem in single installations/cached platforms, but think in virtualization or containers where you won't be able to cache.

Apart of this, during development, those single files will be touched by everybody at the same time so I see here a potential collision point and the most important point... And this change will also mean more work for "stage 1" so more things can fail... so why move in this way if it won't completely fix the issue we are trying to address?

A structured approach based on folders (like we have now, with chains in common.xml if they appear in multiple folders) is much more efficient in resource usage, will give more context (to developers and translators) and will make migration easier.

And at the end (please @MarcRiera correct me) developers don't need to worry much worry much if they repeat a few chains in different translation files because, with a translation server, it will help us looking for coincidences and we can keep the chains sync.

In the other hand, it's true that reducing the number of translation files will make developers work easier (don't need to grep to discover where to place the translation chain) and adding 400k to our memory requirements don't looks like a big deal (that will be less if we compile PO as MO)... and probably it will facilitate the migration to PO (if "stage 2" is still a requirement).

So I don't have a clear winner here, but I think is better a conservative approach and plan a "stage 1" as simple as possible and avoid the unification.

Cheers, m.

NateWr commented 5 years ago

Files don't necessarily have to be combined into one in order to load them automatically -- either at once or on-demand. For example, an index could link keys to files and they could be loaded when an unloaded key is requested.

jonasraoni commented 5 years ago

I just saw this discussion, so I'll leave my two cents =]

The remaining problems will probably be resolved by adopting crowdin et al.

asmecher commented 5 years ago

@NateWr and @jonasraoni, on this idea...

Files don't necessarily have to be combined into one in order to load them automatically -- either at once or on-demand. For example, an index could link keys to files and they could be loaded when an unloaded key is requested.

Thinking this over, there are two approaches off the top of my head:

marcriera commented 5 years ago

And at the end (please @MarcRiera correct me) developers don't need to worry much worry much if they repeat a few chains in different translation files because, with a translation server, it will help us looking for coincidences and we can keep the chains sync.

In the other hand, it's true that reducing the number of translation files will make developers work easier (don't need to grep to discover where to place the translation chain) and adding 400k to our memory requirements don't looks like a big deal (that will be less if we compile PO as MO)... and probably it will facilitate the migration to PO (if "stage 2" is still a requirement).

Yes, it is always better to have duplicate strings than trying to save resources by calling the same string from different parts of the code. There are situations where a target language may need different translations depending on the context, so if everything was reused and there was such a situation, specific action by the developers would be necessary. In addition, CAT tools and translation platforms (such as Weblate) detect repetitions and similar strings, so it would be minimal effort by the translator.

NateWr commented 5 years ago

there are two approaches off the top of my head

I was expecting a third approach, which would be a script to pre-compile the index. I expected it to be a pre-commit hook, so that the index is generated and automatically committed whenever a change is required.

If we find that's too difficult to do for some reason, it could be run during packaging, with an alternate developer mode that would run without the index during development, similar to how our legacy JS files are compiled.

asmecher commented 5 years ago

pre-commit hook

That's totally do-able, but would exclude anyone working with translations outside of a git environment. Translation tweaks are a very frequent modification.