Closed deimidis closed 7 years ago
Adding @A-kilroy
Would love more details on this since I think there could be some overlap in what the foundation is working on to better integrate localizers. We might be able to piggy back off some of that work.
@a-kilroy thanks for jumping in, let me try to answer your question.
For the activate.mozilla.community website, we're using this repository here, which also includes all texts that are displayed on the page. The English, original source files, are located in the _pages folder directly: https://github.com/mozilla/activate.mozilla.community/tree/gh-pages/_pages
For every localization there are the following steps to do:
Guillermo created a Pull Request with a template folder which can be copied to make it easier to add a new language. In hindsight, I'm not sure if that is helping us with this problem here. But let me first describe the translation problem.
As you see above, we have 2 different documents, EN and ES. Now there are two possibilities: 1) Somebody changes a link or something generic in the English document -> this person could change this in the other languages as well, as long as there are no specific language skills needed 2) There is new text, or any change on existing text -> The person changing it can only do it in EN (assuming the source always gets changed first)
In both cases, if there is any language skill involved, it will require a localizer to change the text in the language specific file as well. As it is, the change in the source file will be done as a normal Git commit which does per default not notify anybody watching the repository. So currently there is no way to automatically notify all localizers that something changed and a re-translation should be done.
This is basically the same as if you have a Google Doc with the English text on it and send it out to somebody to translate, they most probably will copy the English one and write the text for the other language. If the English document gets changed, it won't notify the localizer about it.
One possibility would be to have a list of people to notify and if somebody directly changes the text or merges a Pull Request to make this person responsible to notify each one of these persons. This seems cumbersome to me though, but right now I don't really have any other suggestion myself. Another possibility would be Pontoon (l10n tool Mozilla localizers already use) to make this possible, but I don't know how well this would work with full Markdown files instead of small strings. Maybe @mathjazz could enlighten us here?
In any case, we need to make sure that the locales are getting updated as well, not only the English source.
I hope that is a clear (even though very long) problem statement. Feel free to ask if something is not phrased clearly or if you spot any mistakes.
There are numerous reasons why embedding text directly to the code rarely works for localization purposes.
So I suggest you internationalize the site using one of the i18n libraries, which will create resource files. It's a one-time task that will take significantly less time on long term than the current solution.
At the risk of oversimplifing it, an ideal scenario would be to integrate Pontoon right? I believe this is something MoFo is already working on and since they have several sites that use Github pages I expect there could be some overlap/coordination opportunities. I was trying to understand want we'd like to do and the specific problem so that I can figure out the right people to talk to/connect. I think I understand the problem but not the ideal solution. And honestly if it's not helpful I can drop it.
Yes, if the site is internationalized, we can plug it into Pontoon easily, which is used to localize most if not all Mozilla (MoFo & MoCo) websites. We have best practices and docs for this. The contact person for website localization is @peiying2.
Given this discussion, let's put a HOLD on integrating any more locales right now. Happy to see the conversation happening though!
On the other hand there are several l10n plugins for Jekyll like https://github.com/Anthony-Gaudino/jekyll-multiple-languages-plugin . That one handles the translations in .yml files which would allow us to have at least string based translations. On the other hand I'm not sure if our localizers are used to .yml files though. Haven't found one that would use properties-files, but there might be ones as well.
@mathjazz what can you provide us to have Jekyll adapted to what you are suggesting? We don't have the resources to build something here on top of vanilla Jekyll.
Right now we have a couple of P1 languages we need to deliver where people just need to localize a markdown file, we can improve in the future ;-)
Let's do what we can to not fragment the l10n process/tool chain. This will only make it harder for the community to engage on these types of projects.
I agree, that's why we are asking for your help here :)
In the mean time we know our current process is not perfect but we wanted to deploy something fast and scrappy, we can improve as we go ;-)
One possible short-term plan may be extracting strings into your md files and converting those to xliff for use in Pontoon. It seems that there's already a utility out there that can perform that conversion -- https://github.com/tadatuta/md2xliff
The long-term strategy would be to convert everything over to HTML and use the l20n framework.
I see, we want to use markdown to allow non-technical people to add/update content to the site directly from github UI, that's the whole purpose of using Jekyll (also in-build github pages support)
I don't know if we can have markdown for English and then extract strings for other locales?
This short term plan allows you to continue using markdown, while using a standard localization format that Pontoon supports and that preserves the document structure.
El 22 ago. 2016 3:19 PM, "Nukeador" notifications@github.com escribió:
I see, we want to use markdown to allow non-technical people to add/update content to the site directly from github UI, that's the whole purpose of using Jekyll.
I don't know if we can have markdown for English and then extract strings for other locales?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mozilla/activate.mozilla.community/issues/34#issuecomment-241554107, or mute the thread https://github.com/notifications/unsubscribe-auth/AB1yZJJchfbPhLKLQeA48KnsbVg1TsG2ks5qihJigaJpZM4JlXKj .
Cool, any guides on how we should provide the xliff files so people can use pontoon and how to integrate them back, thanks! :-)
I would experiment with that script I linked to in a previous comment to convert between the two formats. Once your comfortable that there's no data loss, if you set up a strings repo with a directory per locale containing the xliff files, Pontoon only needs the URL to the en-US repo directory and can pull them in.
El 22 ago. 2016 3:30 PM, "Nukeador" notifications@github.com escribió:
Cool, any guides on how we should provide the xliff files so people can use pontoon and how to integrate them back, thanks! :-)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mozilla/activate.mozilla.community/issues/34#issuecomment-241556944, or mute the thread https://github.com/notifications/unsubscribe-auth/AB1yZCJhMGK2xgXCDdh1nSY8-u-alUPgks5qihTZgaJpZM4JlXKj .
From the foundation: https://github.com/MozillaFoundation/Advocacy/wiki/Localization:-How-it-happens-during-Copyright This is not exactly the same set up but maybe is helpful. I also believe they have something for their github sites though I can't find it on their wiki so might be worth reaching out to them.
On Mon, Aug 22, 2016 at 11:40 PM, gueroJeff notifications@github.com wrote:
I would experiment with that script I linked to in a previous comment to convert between the two formats. Once your comfortable that there's no data loss, if you set up a strings repo with a directory per locale containing the xliff files, Pontoon only needs the URL to the en-US repo directory and can pull them in.
El 22 ago. 2016 3:30 PM, "Nukeador" notifications@github.com escribió:
Cool, any guides on how we should provide the xliff files so people can use pontoon and how to integrate them back, thanks! :-)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mozilla/activate.mozilla.community/ issues/34#issuecomment-241556944, or mute the thread https://github.com/notifications/unsubscribe- auth/AB1yZCJhMGK2xgXCDdh1nSY8-u-alUPgks5qihTZgaJpZM4JlXKj .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mozilla/activate.mozilla.community/issues/34#issuecomment-241559902, or mute the thread https://github.com/notifications/unsubscribe-auth/ATI2-HozngMJJHBQL7FIIN-pGkKo-8szks5qihdhgaJpZM4JlXKj .
@nukeador Converting MD to XLIFF and back to MD as @gueroJeff suggested sounds like your best bet.
You will have to convert MD to XLIFF every time there's a change in the source languge and also convert back from XLIFF to MD regularly so translations are deployed to production.
You can use this existing repository or a separate one for storing XLIFF files. See Section A of Pontoon docs for more details: https://developer.mozilla.org/en-US/docs/Mozilla/Implementing_Pontoon_in_a_Mozilla_website
@mathjazz I've done a quick test with the md2xliff tool and I have found two issues:
Ideas? :-)
What I managed to do was:
Discussed this in France with nikos - what are your thoughts @comzeradd ? We should aim to have the same system for Clubs site.
Yes, we also had a brief conversation with @nukeador about this. Since we have the requirement of keeping markdown this limits our options . For instance using something like webL10n or the solution @a-kilroy posted above from the advocacy page.
I'm not very familiar with pontoon, but why md2xliff is not good enough? Is it much of a problem that it doesn't produce diffs and re-created the whole file?
We could have a script to provide diffs and recreate but I was wondering if this is something that Pontoon is able to handle.
To sum up, I see a few requirements here:
All jekyll's localization plugins and methods involve opening Pull Requests for localized content, which is not desired in our case.
I did some tests with md2xliff
and besides the issues with the metadata headers it works nice. So my suggested course of actions would be:
locales
folder in this repository to put the xliff files, in the structure documentation suggests, and give write access to pontoon.@comzeradd I agree. What would you need?
@mathjazz @gueroJeff Is this something you can support us?
Thanks!
Sounds like a plan!
What matters for Pontoon is that files in a supported file format are available at the right place in the repository it can write to. And that's covered by the plan proposed by @comzeradd already!
I'm no expert in XLIFF files, but since it's a bilingual file format, I suspect every time a new en-US XLIFF file is generated, we'd also need to merge those changes into localized XLIFF files. There must be scripts that do this. I'll add @gueroJeff and @flodolo to comment on that (both of them are currently on conferences). Please note that Pontoon can work without this step, but your application might not.
I'm no expert in XLIFF files, but since it's a bilingual file format, I suspect every time a new en-US XLIFF file is generated, we'd also need to merge those changes into localized XLIFF files. There must be scripts that do this. I'll add @gueroJeff and @flodolo to comment on that (both of them are currently on conferences). Please note that Pontoon can work without this step, but your application might not.
Wait. Your app doesn't use XLIFF files directly, it uses MD files. So as long as the xliff2md script can create valid localized MD files, this step is not needed.
That's the last step I believe. Requirements under Section A need to be met first: https://developer.mozilla.org/en-US/docs/Mozilla/Implementing_Pontoon_in_a_Mozilla_website
That's basically steps 2 and 3 from your list.
Yes, good point. I started creating the locales files. One thing I'm not sure about is whether I should include the original (en-US) files too, since xlf files have a source and target locale anyway.
Yup, we need the en-US folder with original files.
To give you an idea, here's the (only) xliff-based project we currently localize: https://github.com/mozilla-l10n/firefoxios-l10n/
Thanks
I added the locales files, gave write access to the mozilla-pontoon bot and update the bug :)
Thanks @comzeradd! Could you use the .xliff
file extension?
🎉
Thanks!
XML parser is throwing an error:
Traceback (most recent call last):
File "/app/.heroku/python/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/newrelic-2.50.0.39/newrelic/hooks/application_celery.py", line 66, in wrapper
return wrapped(*args, **kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
return self.run(*args, **kwargs)
File "/app/pontoon/sync/core.py", line 59, in wrapped_func
return func(self, *args, **kwargs)
File "/app/pontoon/sync/tasks.py", line 226, in sync_translations
vcs_project.resources
File "/app/.heroku/python/lib/python2.7/site-packages/django/utils/functional.py", line 33, in __get__
res = instance.__dict__[self.name] = self.func(instance)
File "/app/pontoon/sync/vcs/models.py", line 291, in resources
resources[path] = VCSResource(self, path, locales=locales)
File "/app/pontoon/sync/vcs/models.py", line 413, in __init__
resource_file = formats.parse(resource_path, source_resource_path, locale)
File "/app/pontoon/sync/formats/__init__.py", line 44, in parse
return SUPPORTED_FORMAT_PARSERS[extension](path, source_path=source_path, locale=locale)
File "/app/pontoon/sync/formats/xliff.py", line 127, in parse
xliff_file = xliff.xlifffile(f)
File "/app/.heroku/python/lib/python2.7/site-packages/translate/storage/xliff.py", line 549, in __init__
lisa.LISAfile.__init__(self, *args, **kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/translate/storage/lisa.py", line 282, in __init__
self.parse(inputfile)
File "/app/.heroku/python/lib/python2.7/site-packages/translate/storage/lisa.py", line 358, in parse
self.document = etree.fromstring(xml, parser).getroottree()
File "lxml.etree.pyx", line 3103, in lxml.etree.fromstring (src/lxml/lxml.etree.c:70569)
File "parser.pxi", line 1828, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:106403)
File "parser.pxi", line 1716, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:105194)
File "parser.pxi", line 1086, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:99876)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94350)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95786)
File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:94853)
XMLSyntaxError: xmlParseEntityRef: no name, line 145, column 165
Seems like &
's need to be escaped, e.g.:
https://github.com/mozilla/activate.mozilla.community/blob/gh-pages/locales/es-ES/webvr-camp.xliff#L145
Has anyone tried to go back to .md from these files, localizing a few random strings?
In only checked a couple of file: one looked fine, but others look full of unnecessary fragments. Example:
I was honestly expecting something else: no markup, just the text and some form of template to inject translations into. This seems really brittle, and given the size of the content, it would be great to do a proper testing before asking people to work on it, and potentially lose work.
Seems like &'s need to be escaped, e.g.:
@mathjazz I substituted &
with &
. Could you check that this work?
Thanks
Has anyone tried to go back to .md from these files, localizing a few random strings?
Yeap. But you need the skeleton files that md2xliff
created to reverse the process properly. I can add them to the repo if this doesn't create any problem to the pontoon bot (because they live inside the same folders as xliff files).
what's this strange markup?
This is the way we add specific css classes and markup to the content. There is no way to avoid this if we want the reverse process of reconstructing the markdown files to work without someone having to spend a lot of time to manually adding markup code again. On pontoon we just have to copy this to the localized target.
@mathjazz I substituted & with &. Could you check that this work?
It seems like some &
s in locale files are not escaped yet:
https://github.com/mozilla/activate.mozilla.community/blob/gh-pages/locales/pt-PT/test-pilot.xliff#L129
BTW, for URLs you should probably use %26
instead of &
:
https://github.com/mozilla/activate.mozilla.community/blob/gh-pages/locales/en-US/test-pilot.xliff#L133
Thanks. I made the substitutions on url on all locales.
Thanks @comzeradd!
I've successfully set the test project up on Pontoon stage server (the link will be broken in a few weeks from now): https://mozilla-pontoon-staging.herokuapp.com/fr/activate-test/all-resources/?string=159738
I was also able to make a test commit to the repository: https://github.com/mozilla/activate.mozilla.community/commit/f4aa015add804176a6b89c7fa7dffed111284842. It would be great if you could use the same whitespace as Pontoon, so the diff would be easier to read, but that's the lowest possible priority.
The next step could be for someone to review the original strings and see if we can simplify them as flod suggested. There's lot's of markup and strings that don't need to be translated.
Everything looks ok. We indeed have some markup in there. One option would be to copy them to the localized side once it hits production, to make it easier for people to ignore them. If we remove them, then the reconstructing process would need much more manual work from someone from this team and would probably lead to slow updates on the localized content.
What are the next steps here?
I think we are good to move this to production Pontoon.
Thanks, @comzeradd!
Leaving it to the project management team. /cc @peiying2
Thanks everyone for brainstorming and finalizing a process so we can proceed.
I went through some of the strings, and saw a need for an explicit list of instruction on the kind of strings that are for localization while others should be ignored. I need to compile this list and include in my email communication to the localizers.
I would go even further: all strings that are supposed to remain identical should be pre-translated to avoid a mess, and reduce the amount of copy and paste for localizers.
This might give you some ideas https://github.com/mozilla-mobile/firefox-ios-build-tools/blob/master/scripts/update-xliff.py
I just pushed a commit to pre-fill all the strings that contain only mark-up. That will hopefully reduce the complexity for localizers.
@peiying2 @mathjazz Can the strings go live now? We can also document the process for localisers here in github and elsewhere as needed.
LGTM.
@peiying2 ?
As @MichaelKohler suggest in another issue, we need to set a communication to all the locales availables when a content is updated, so they could make the changes.