theCrag / website

theCrag.com: Add your voice and help guide the development of the world's largest collaborative rock climbing & bouldering platform
https://www.thecrag.com/
109 stars 8 forks source link

selection of internationalization framework for the server application #2208

Closed scd closed 8 years ago

scd commented 8 years ago

I have been doing some research for selecting a framework for server internationalisation in a perl system. It comes down to a selection of two methodologies:

  1. Maketext: object oriented, no wider support then perl. The translator needs to know a small number of perl idioms. There are also underlying flaws in the methodology.
  2. gettext: wide support outside of perl, lots of tools etc. This is what unix systems use and the translation database an be found in the locale directory.

Maketext was built over 15 years ago to cater for deficiencies in the gettext project. I found this to be a reasonable summary:

http://rassie.org/archives/247

More perl systems use Maketext, but there are strong recommendations not to use it. However there are equal warnings about performance and thread safeness of gettext.

I have installed a pure perl gettext module. I have got this working technically in a test app. Subject to some performance testing (notes to self on what I have to be comfortable with):

@brendanheywood have you had any experience with internationalization.

@nicHoch I would like to align the app and web system translation databases so we can share a common translation database between the two. What standards do you use for the app? Are you using gettext? Are there any mapping tools?

brendanheywood commented 8 years ago

Yeah moodle has rolled there own, but the concepts are universal. These are the things I want in a system:

http://pootle.translatehouse.org/

scd commented 8 years ago

This is what I am hoping to use:

http://www.gnu.org/software/gettext/gettext.html

Apparently there are lot's of tools, but have not done a search. Just did a quick search and here is an example.

https://poedit.net/

nicHoch commented 8 years ago

We use gettext and poedit in the Wordpress-plugin and it is working well.

right now i use a json format in the app but that is centralized and can be changed. there is a package for node.js https://www.npmjs.com/package/node-gettext so it might work straight away or with some adjustments.

To share the same translation base would be very good. We should provide a common translation base for both projects (web/app).

Common: onsight -> ... route -> ... tick -> ... Where to stay->... Limestone -> ...

Web: specific page titles, messages, error descriptions ...

App: specific page titles, messages, error descriptions ...

scd commented 8 years ago

Note to self. This is what I am testing:

http://search.cpan.org/dist/libintl-perl/lib/Locale/TextDomain.pm

scd commented 8 years ago

Hmmm, having problems with the the perl package for this. There are two issues.

  1. Setting locales is not trivial on debian
  2. Even if we decide to battle the setting locales the setlocale function is documented not to work on multi-threaded environments.

This is totally demented. The perl package requires you to set the locale before doing the translation lookup. Setting locales can only be done for locales configured in debian, see

https://people.debian.org/~schultmc/locales.html

Assuming we want to manage this process for each new translation language, there is a question mark on performance for setting locale and also a re-entrant problem for multi-threaded.

Even if this worked I am thinking that there are still some functional gaps for what I would like to see, so I am thinking that we roll our own.

Any problems with this decision?

scd commented 8 years ago

The overall work flows I want to see for internationalisation are:

  1. Developer adds a tag(s) to outputs in the code libraries, and templates.
  2. The system self reports on untranslated texts tagged in the code.
  3. There is a community tool where untranslated text can be translated and is automatically loaded back into the system real time.
  4. It works both for app and websystem

Bonus

  1. I would like to see the numbers views of particular languages so we know how to prioritise
  2. I would like to see number of translations for each text

What I am thinking of implementing:

If we do this above then this system could also be used for route and area descriptions. I know this needs more discussion, but I see possibilities.

I have ignored site articles because I propose to move them into the node index and use our description framework. If we do this then there will be nothing to solve for translations for articles (both the url and content will work the same as area url and area descriptions).

I am ready to become really excited about this implementation and push hard. Please provide feedback on the general direction and we can enter into specific design decisions as I uncover particular issues.

I am getting started immediately - well going for a run, having lunch and helping Finley with his maths first.

I'm psyched :)

scd commented 8 years ago

Table structure (x3 tables)

SourceText Context: mostly blank but can be used for specific structured data (eg menus) MessageText: the text to be translated as appears in the source code/template Archived: flag for turning off texts that no longer need translating

TextTranslation SourceTextID: LanguageID: (we have a language table, so I think we should link to that) TranslationText: Archived: if source text is archived then we archive the TextTranslation Active: We should do versions, using the active flag to indicate which version we want to use. This is similar to the area descriptions so we never actually delete a translation, just update it.

TranslationApplication SourceTextID: Label: For example: app, web, web.template, web.template.file. A translation may apply to multiple applications. ContentHash: A content hash is calculated based on all the translation content. Any external application can work out if there are any changes based on the hash changing.

Developer will throw a wrapper function around all source texts in the system. This will be used as the lookup key, and if it is missing it will create it. If any translations require a context then this will be included explicitly in the code.

As an aside, we can record stats of the usage of various text translations so we can archive translations that are no longer used. For example the web system may originally use the words 'Hello world' as a translation, but at some point the programmer changes it to "G'day". This means we no longer need to translate Hello world because it is no longer used in the source code.

The above example may be a bit trite, but it could be a sentence in an email which was changed by the programmer because there was something incorrect in the sentence.

The developer never has to touch TextTranslations. This is managed by translators using the UI. A translator may search for a list of untranslated texts for a particular language and then complete the translations. They can also update an existing translation they find is translated with an error. This will make the previous version inactive and create a new version.

Various other applications will need a database of translations that apply to that particular application. For example the mobile app should be able to get all translations that relate to it. The TranslationApplication table specifies which translations apply. For example the mobile app will need all translations that have been set to 'app'

A particular translation may apply to multiple applications (eg 'web' and 'app').

Furthermore it is worthwhile to specify a deeper level that a translation applies to. For example the web templates and the specific template. So a particular translation could have the following applications

This means we could audit all the pagination translations or all the web template translations.

We can take this a bit further for javascript pages

Now the gympage javascript has a couple of translations in it, so it can do an API call to get the translations table for the web.javascript.gympage as a single API call rather then a call for each translation.

If the developer is lazy and does nothing but write the english text in the source code wrapped with a translation function then eventually this will make it to the translation table and be translated by the translators.

For a particular release we may want to pre-translate before the release. The developer then runs a script to register the texts in the server which are translated by the translators before the release.

The app development process will also have a similar registration process, so the app can be released with an up to date translation database.

But what happens if the german translator is away on holidays and translations of the app are not complete before the app release. We can either wait to release the app or have the app lookup the translations database on the server and get a new database if the hash changes.

nicHoch commented 8 years ago

If a new key is introduced to the db than it has no initial translation. For short lables like buttons the English translation is very close to the key: add.route : add new route.

I release should be blocked until a englich translation is available for each key.

The default translation is always the current English version. So if there is no german tranelation for a key the English one is shown.

If the English translation of a key has changed it should invalidate all related translation in other languages.

We need an api endpoint as a translation summary that list all available languages and some kind of version number. Than the app knows if an update is available. I would like to stay with static translation files in the app. So a new version number is not triggered with each single update but on milestones. The app calls the translation endpoint for a particular language and stores the result locally.

Nicky

Am 29. Mai 2016 09:10:30 MESZ, schrieb Simon Dale notifications@github.com:

Table structure (x3 tables)

SourceText Context: mostly blank but can be used for specific structured data (eg menus) MessageText: the text to be translated as appears in the source code/template Archived: flag for turning off texts that no longer need translating

TextTranslation SourceTextID: LanguageID: (we have a language table, so I think we should link to that) TranslationText: Archived: if source text is archived then we archive the TextTranslation Active: We should do versions, using the active flag to indicate which version we want to use. This is similar to the area descriptions so we never actually delete a translation, just update it.

TranslationApplication SourceTextID: Label: For example: app, web, web.template, web.template.file. A translation may apply to multiple applications. ContentHash: A content hash is calculated based on all the translation content. Any external application can work out if there are any changes based on the hash changing.

Developer will throw a wrapper function around all source texts in the system. This will be used as the lookup key, and if it is missing it will create it. If any translations require a context then this will be included explicitly in the code.

As an aside, we can record stats of the usage of various text translations so we can archive translations that are no longer used. For example the web system may originally use the words 'Hello world' as a translation, but at some point the programmer changes it to "G'day". This means we no longer need to translate Hello world because it is no longer used in the source code.

The above example may be a bit trite, but it could be a sentence in an email which was changed by the programmer because there was something incorrect in the sentence.

The developer never has to touch TextTranslations. This is managed by translators using the UI. A translator may search for a list of untranslated texts for a particular language and then complete the translations. They can also update an existing translation they find is translated with an error. This will make the previous version inactive and create a new version.

Various other applications will need a database of translations that apply to that particular application. For example the mobile app should be able to get all translations that relate to it. The TranslationApplication table specifies which translations apply. For example the mobile app will need all translations that have been set to 'app'

A particular translation may apply to multiple applications (eg 'web' and 'app').

Furthermore it is worthwhile to specify a deeper level that a translation applies to. For example the web templates and the specific template. So a particular translation could have the following applications

  • web
  • web.templates
  • web.templates.pagination

This means we could audit all the pagination translations or all the web template translations.

We can take this a bit further for javascript pages

  • web
  • web.javascript
  • web.javascript.gympage

Now the gympage javascript has a couple of translations in it, so it can do an API call to get the translations table for the web.javascript.gympage as a single API call rather then a call for each translation.

If the developer is lazy and does nothing but write the english text in the source code wrapped with a translation function then eventually this will make it to the translation table and be translated by the translators.

For a particular release we may want to pre-translate before the release. The developer then runs a script to register the texts in the server which are translated by the translators before the release.

The app development process will also have a similar registration process, so the app can be released with an up to date translation database.

But what happens if the german translator is away on holidays and translations of the app are not complete before the app release. We can either wait to release the app or have the app lookup the translations database on the server and get a new database if the hash changes.


You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/theCrag/website/issues/2208#issuecomment-222346401

www.thecrag.com

Nicky Hochmuth www.thecrag.com/climber/nickyhochmuth

scd commented 8 years ago

If a new key is introduced to the db than it has no initial translation. For short lables like buttons the English translation is very close to the key: add.route : add new route.

But we already know what the english default translation is as it is in the system.

If the English translation of a key has changed it should invalidate all related translation in other languages.

This will happen by default

We need an api endpoint as a translation summary that list all available languages and some kind of version number. Than the app knows if an update is available. I would like to stay with static translation files in the app. So a new version number is not triggered with each single update but on milestones. The app calls the translation endpoint for a particular language and stores the result locally.

Yup. The ContentHash will be used for a version number. I will hash all the content related to the app and if it changes the content hash will change.

We need to think about the process of officially releasing a language. Is it free for all, or do we wait for a committed translator. In otherwords if somebody does 10% of the task in Italian then does this make the language available or do we have an approval process. I think laissez faire is better because it is motivating for people to become translators if they see it half translated. Anyway this is a minor point.

scd commented 8 years ago

Also there is no reason why the initial source text has to be english. It may well be, but it is not an assumption.

brendanheywood commented 8 years ago

Will have more to say once I'm back at my desk but quick summary:

MessageText: the text to be translated as appears in the source

I'd vastly prefer if all Lang strings used a key which isn't an English string, this is better long term and makes it easier to edit the English Lang versions too

Label: For example: app, web, web.template, web.template.file. A translation may apply to multiple applications.

I see this as not needed. The client should just ask for a set of keys and that's it, the api shouldn't need to approve or check the context.

Developer will throw a wrapper function around all source texts in the system. This will be used as the lookup key, and if it is missing it will create it.

See above, would vastly prefer us to re-key each string as we migrate which will reduce a lot of issues later. Eg you could have a word like 'Tick' which is used in two contexts, eg as a noun and verb. When translated it needs to be two keys (eg 'action.tick' and 'tick type.tick'. On the opposite side as we migrate words we will find lots of similar redundant strings we should merge into one key and make the whole platform more consistent. Eg names on buttons and menu items for the same thing but are currently different. The key ideally should describe the semantics of its use, and we should come up with some simple consistent conventions for these.

Your example of 'hello world' -> 'g'day' is exactly why we should not use the current English strings as the keys. It's short term gain vs long term pain. You way means that if the English string changes then every other language needs to change too. My way means that only the English string would change in isolation. If the underlying semantics change then you simply create a new key.

so it can do an API call to get the translations

As above I think the 'applications' of a key is a bit of an anti feature and a hurdle with very little value. if you require the application(s) to be approved before use you add a hurdle, if you don't require it then it won't get used. We don't need this. Any page should be able to just specify that it needs a couple extra keys for use in js, and then the strings get appended into the page for later use in js land. We would also have an api but it shouldn't need an Api call in almost all cases (for website js).

(I'm reading as I comment so some stuff is doubled up)

If lang keys are missing then default to English is fine. We should never release without an English string, but happy to release without waiting for other languages to be complete.

nicHoch commented 8 years ago

We should support at least a app flag context (or the other way around: exlude in app). I guess the app keys would be a small subset. The app will ask not for a specific keyset on every page but for a complete set once (and in case of update)

Am 29. Mai 2016 14:11:24 MESZ, schrieb Brendan Heywood notifications@github.com:

Will have more to say once I'm back at my desk but quick summary:

MessageText: the text to be translated as appears in the source

I'd vastly prefer if all Lang strings used a key which isn't an English string, this is better long term and makes it easier to edit the English Lang versions too

Label: For example: app, web, web.template, web.template.file. A translation may apply to multiple applications.

I see this as not needed. The client should just ask for a set of keys and that's it, the api shouldn't need to approve or check the context.

Developer will throw a wrapper function around all source texts in the system. This will be used as the lookup key, and if it is missing it will create it.

See above, would vastly prefer us to re-key each string as we migrate which will reduce a lot of issues later. Eg you could have a word like 'Tick' which is used in two contexts, eg as a noun and verb. When translated it needs to be two keys (eg 'action.tick' and 'tick type.tick'. On the opposite side as we migrate words we will find lots of similar redundant strings we should merge into one key and make the whole platform more consistent. Eg names on buttons and menu items for the same thing but are currently different. The key ideally should describe the semantics of its use, and we should come up with some simple consistent conventions for these.

Your example of 'hello world' -> 'g'day' is exactly why we should not use the current English strings as the keys. It's short term gain vs long term pain. You way means that if the English string changes then every other language needs to change too. My way means that only the English string would change in isolation. If the underlying semantics change then you simply create a new key.

so it can do an API call to get the translations

As above I think the 'applications' of a key is a bit of an anti feature and a hurdle with very little value. if you require the application(s) to be approved before use you add a hurdle, if you don't require it then it won't get used. We don't need this. Any page should be able to just specify that it needs a couple extra keys for use in js, and then the strings get appended into the page for later use in js land. We would also have an api but it shouldn't need an Api call in almost all cases (for website js).

(I'm reading as I comment so some stuff is doubled up)

If lang keys are missing then default to English is fine. We should never release without an English string, but happy to release without waiting for other languages to be complete.


You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/theCrag/website/issues/2208#issuecomment-222357330

www.thecrag.com

Nicky Hochmuth www.thecrag.com/climber/nickyhochmuth

scd commented 8 years ago

Whoops, I did not mean to close this issue before, I hope you did not read anything into that as this is early on the discussion phase.

I'd vastly prefer if all Lang strings used a key which isn't an English string, this is better long term and makes it easier to edit the English Lang versions too

I reservedly disagree. We have got english already in the system and database, so we should just continue to use that rather then force that through a translator. For example in the welcome email I would just want to keep the opening paragraph as :

"Welcome to www.thecrag.com, a community that is building the world's largest collaborative rock climbing database."

Rather than try and think of some non English key like emails.signup.opening_paragraph

Email templates have logic in them just like web page templates so we cannot have different language versions of whole emails.

I see this as not needed. The client should just ask for a set of keys and that's it, the api shouldn't need to approve or check the context

I don't understand???

The mobile app will need to get a copy of all translations relevant to it, as the mobile app will surely be installilng it's own local translation database. This is not context. It has got nothing to do with the client asking for keys.

Eg you could have a word like 'Tick' which is used in two contexts, eg as a noun and verb

This is what the context variable is for. There are a couple of different scenarios we should develop policy for.

Mostly single words like 'Tick' come form the database configuration. There is a well described context for these words, eg tablename. Verb and Noun usages are never mixed in these contexts. Instead of something like action.tick the context variable would be used as action,Tick.

Regardless of which way we go we need to take care on how develop either the context keys or your version of the non-English keys. It could quickly become like my filing system directory structure.

I think we should use a combination of the table names and row labels so we don't have to invent anything new.

You way means that if the English string changes then every other language needs to change too.

Yes that is a drawback. While I could convinced for menu labels and system config I am not convinced with longer paragraphs like the signup email. I am wondering if a hybrid solution might be better. So we always pass in a non-English tag and the english version as the second argument.

What about something like the faceted search summary, which outputs in english something like:

Showing all 8 ascents.

I would use a key something like 'Showing all {count} ascents'. Actually it may be something like 'Showing all {count} {object}' to cover all facet types.

Which also reminds me I forgot plurals in the design.

Let's consider three cases separately:

scd commented 8 years ago

Maybe we can take a step back and make sure we are talking about the same issues.

I have assumed that the translators would use the translation key as their input for for translation. This is absolutely fine for context plus english text solution. However it is not possible for keys like hello.user.role as it would not be reasonable for them to translate based on this alone. I am guessing that you would expect the translator of non-english versions associated with this key to look at the english translation first.

As a developer, if I wanted to display a notice regarding the process or data I don't want to have to manage a separate registration system just to get this text into the system. It is fine for all the system config data where we already know all this, but for new stuff it could be a real pain.

I have no expectation that we will be releasing all features fully translated. If we paid translators then it would be great to be able to do this, but because we will have volunteer translators, we will be releasing without translations. This means there has got to be a graceful fallback to a default text.

There has been a general design philosophy in a lot of internationalisation systems I have read about, to keep the impact on the code for the developer at a minimum. Doing this like __ "hello world" (just adding the double underscore in front of texts). For various reasons, this may be an older paradigm which is no longer appropriate. I am happy to break from this philosophy.

As you say, having a pure english key is problematic with changing texts. Texts don't change that often though, but could lead to things unnecessarily broken.

I am now thinking that a hybrid system is the way to go, to get the benefits of both worlds. Something like

gettext('facet.summary',count=>$count,original=>'Showing all {count} ascents')

If 'facet.summary' was pre-registered then original parameter would be optional.

The message key table would look slightly different. Instead of having an context it would have a lookup key which would permanently tie any translations. It would also have the original text in a separate column.

I am happy this would work for system config (eg 'tick') and for standard texts in the system.

Not so sure it would work for system emails, mainly because I don't know how to come up with a meaningful message key. For example I am not sure if this is meaningful:

gettext('email.signup.welcome_to',orginal => "Welcome to www.thecrag.com, a community that is building the world's largest collaborative rock climbing database")

The problem here is that it is texts like this that are most likely to change. So maybe the above is totally reasonable.

We have not discussed how area descriptions are going to be translated. Nicky has proposed something that has been implemented, but we never discussed this as a final solution.

Could we use the new translating system for translating area descriptions. I think it is possible using the node id as part of the message key.

gettext('area.description.12345',original => $node->{description})

This could be flagged specially so that access to the translation is via bulk edit rather than the translation system. This means that it does not matter that there is an id in the message key.

brendanheywood commented 8 years ago

I reservedly disagree. We have got english already in the system and database, so we should just continue to use that rather then force that through a translator. For example in the welcome email I would just want to keep the opening paragraph as : "Welcome to www.thecrag.com, a community that is building the world's largest collaborative rock climbing database." Rather than try and think of some non English key like emails.signup.opening_paragraph

emails.signup.opening_paragraph is exactly what you need, and I bet it only took you about a second to come up with it, and it clearly states the exact semantics of what it is. So what's the downside here? This is how all lang frameworks I've used in java / js / php work. The reality is that you have to touch the code anyway to add the wrapper function, so there is no real time saving but not doing it. And the downsides of not using language-less keys are significant.

Email templates have logic in them just like web page templates so we cannot have different language versions of whole emails.

Exactly, this is an argument for breaking it up into each chunks of text and keying each properly. On top of that much of the language used in an email may be shared across emails, so those chunks / keys will be refactored, or things like a button label in the email will share a key with a button used in the web, again we get consistency across the platform.

At the end of the day translating emails is exactly the same as translating the web or the app, and if it needs to be considered differently then this is a smell that the proposed system isn't sufficient.

Once this whole process is complete you should look at a template and not see anything in english.For reference here is an example email template form moodle, showing the clear separation of template logic (in mustache) from language strings {{# str }} parent, forum {{/ str }}

https://github.com/moodle/moodle/blob/master/mod/forum/templates/forum_post_email_htmlemail_body.mustache#L117

I think it a much more useful idea to instead thinking about this whole exercise as translating from English to whatever, instead think of it as a refactoring process, removing all duplicated text and moving it away into a single sources of truth removed and independent from the templates. Keeping it DRY. By keeping the raw English in the templates you are shooting yourself in the foot for later and losing all the benefits.

Just to drive home this point, lets consider this simple string in the emails:

minor/supporterFooter-6-
minor/supporterFooter-7-https://www.thecrag.com/donate
minor/supporterFooter-8-
minor/supporterFooter:9:You may also find your account profile page here:
minor/supporterFooter-10-
minor/supporterFooter-11-https://www.thecrag.com/climber/<% lc $data->{login} %>
--
Supporter-Monthly-Cancel-29-
Supporter-Monthly-Cancel-30-https://www.thecrag.com/donate
Supporter-Monthly-Cancel-31-
Supporter-Monthly-Cancel:32:You may also find your account profile page here:
Supporter-Monthly-Cancel-33-
Supporter-Monthly-Cancel-34-https://www.thecrag.com/climber/<% lc $data->{login} %>
Supporter-Monthly-Cancel-35-
--
Supporter-OnceOff-Charge-40-
Supporter-OnceOff-Charge-41-https://www.thecrag.com/donate
Supporter-OnceOff-Charge-42-
Supporter-OnceOff-Charge:43:You may also find your account profile page here:
Supporter-OnceOff-Charge-44-
Supporter-OnceOff-Charge-45-https://www.thecrag.com/climber/<% lc $data->{login} %>
Supporter-OnceOff-Charge-46-

Lets say you want to do it you way, then as part of this process you end up with something like:

minor/supporterFooter:9:get_string('You may also find your account profile page here:')
Supporter-Monthly-Cancel:32:get_string('You may also find your account profile page here:')
Supporter-OnceOff-Charge:43:get_string('You may also find your account profile page here:')

Now lets consider a tiny tweak were we instead want it to say 'Your public profile page is here:', Do we A) change it in all 3 places? Now the translators have work for no reason. or B) we leave it as is an just add an English String which overrides this, which now means that what you read in the template confusingly doesn't match what you end up with in the email? Neither of these is a good situation and we want to avoid both.

Another scenario, as a coder, because you are looking at an english sentence, lets say you don't search everywhere for it, assume it's unique and just change it in 1 place. Now we have diverged and split the string, the translators need to add a new second string, and we've reduced consistency across the platform. We completely avoid this using proper keys.

I am wondering if a hybrid solution might be better. So we always pass in a non-English tag and the english version as the second argument.

No, this would be a combination of the worse attributes of both approaches.

Which also reminds me I forgot plurals in the design.

Plurals should be handled already just by using separate keys.

assign/lang/en/assign.php-427-$string['subplugintype_assignfeedback'] = 'Feedback plugin';
assign/lang/en/assign.php:428:$string['subplugintype_assignfeedback_plural'] = 'Feedback plugins';

This is something where maketext conceptually diverges from gettext, but in the process added a lot of complexity with marginal value.

True there is some loss of flexibility, as some languages have much more complex grammar which would mean you really need a bunch more keys, but practically it's easy to work around this and keep it simple.

brendanheywood commented 8 years ago

I have assumed that the translators would use the translation key as their input for for translation. This is absolutely fine for context plus english text solution. However it is not possible for keys like hello.user.role as it would not be reasonable for them to translate based on this alone. I am guessing that you would expect the translator of non-english versions associated with this key to look at the english translation first.

You should look at the translation took available. You could have a matrix with both the key and a list of other languages all together so you can use all of them to guide it.

There has been a general design philosophy in a lot of internationalisation systems I have read about, to keep the impact on the code for the developer at a minimum.

I whole heartedly agree with this. using a function called _() or just lang() will help a lot here.

For example I am not sure if this is meaningful: gettext('email.signup.welcome_to',orginal => "Welcome to www.thecrag.com, a community that is building the world's largest collaborative rock climbing database")

Agree this isn't useful, it should just be like:

gettext('email.signup.welcome')

We have not discussed how area descriptions are going to be translated. Nicky has proposed something that has been implemented, but we never discussed this as a final solution.

There is a massive conceptual difference between the interface translation, and the data translation. From the mason templates point of view, the data in the templates, eg markdown, route names, tick names, tag names, should already be in the right language and so it wouldn't never call gettext() on any data. There are a couple of things that initially might seem like interface which are actually data, your example of tick types is one, as this is not 'code'.

We should very clearly draw a line in the sand between these two and never blur them.

scd commented 8 years ago

Looking at it as a refactoring exercise is probably a good way of thinking about it.

Reading between the lines from your comments that you are not a fan of the hybrid idea, providing an optional original text to initially register the string.

What is the workflow for me when I am writing a new system email. So I come up with

gettext('email.template.opening_paragraph')

as the key.

What is the process of getting that string into the development and production database?

brendanheywood commented 8 years ago

Well this depends on a lot on how we hand roll it. In most projects, eg taking moodle again as an example, all en lang packs are source code and under version control in git. So converting an old english string is just moving text from one file to another.

Moodle has a very separate online tool which the translators use to create the language packs, which are available via an api, and moodle can grab them on the fly as needed. Moodle is sorta like your hybrid idea, in that in code it has both the key and the English version, but the crucial difference is that the English is refactored in the lang page file and not in the original template. Other projects use similar tools but store all the lang packs in git side by side. This latter is much more common.

But I think you want to stick it into the DB instead, so not really sure. A few options come to mind:

a) we still put everything into git, the online tool simply works on dev and reads into DB and then writes these files from DB. So the effect is instant, but only on dev, but this also means people can tweak and refine and iterate and finally get it right, and then only release the translations at each release. This is how I'd say the vast majority of projects work and gives them a lot of assurance over what is being released. Ultimately I think we will end up with a small team of fairly intimately involved translators, so this may be very feasibly. Live editing lang strings is a massive vector for abuse and accidents, but easily solved too. eg in moodle lang strings are just a normal array in a php file, you get all the code file compiling and caching for free.

See: https://docs.moodle.org/dev/AMOS_manual#Translation_workflow

b) we create a new separate online tool with it's own DB, both dev and prod both load from it, but we'd probably have some caching so we'd just need to clear the prod cache after a translator has done their thing. This would give us a pseudo release process, but be independent from the code releases.

A left field thought, things like tick types, are sort of in this gray area between code and data. We can easily push it either way, ie we could use get_string('ticktype.'.$ticktype) so we don't need to touch the tick type tables to add a lang column, and all the lang pack editing is done in one place instead of doing some in files, and some in the db (although the lang edit tool could hide this). It may be easier to force more things into the 'code' camp than the 'data' camp, leaving only real data for a different process (eg node names, descriptions etc)

scd commented 8 years ago

Ok, now I think we are getting a lot closer. Workflow has always been my biggest issue here. I think we will be able to close in on a particular model. Can we use parts of both processes you suggest?

Registering keys and initial language translation done in git hub. I am comfortable with doing this in a separate lang file like Moodle. This covers my developer workflow issues.

I still think there are real strong workflow advantages for having the online editing of translations with it's own minimal release procedure. I want to give somebody like Ulf the ability to do a whole lot of translation editing and release without involving me.

We will need a process to load the developer lang files into the online tool. This can be a command line which reads the lang files and uses the online tool interface to upload the new language keys and english translations. I think it would be important for the english translations to be read only on the online tool, so there is only one source of truth and that is on git hub.

Overall process flow:

If you are happy with the overall process flow then we can talk about implementation specifics.

A left field thought, things like tick types, are sort of in this gray area between code and data. We can easily push it either way, ie we could use get_string('ticktype.'.$ticktype) so we don't need to touch the tick type tables to add a lang column, and all the lang pack editing is done in one place instead of doing some in files, and some in the db (although the lang edit tool could hide this). It may be easier to force more things into the 'code' camp than the 'data' camp, leaving only real data for a different process (eg node names, descriptions etc)

Yes I was about to raise this. Let's call this database config. My original thoughts were to have these in the database. Having looked at the internationalisation methodologies there are strong pragmatic reasons to just use whatever we decide to go with for code internationalisation. This has only been a recent change in though processes for me.

If we were to implement this in the db only then it would also complicate the API which would have to return the particular language. I think is far easier to just assume the configuration api fields are language agnostic and managed in the mobile app via the language database rather than the API.

The good thing about the config fields is they come from known tables and have known rows. They all have duel fields of 'Name' and 'Label' (the later being the language agnostic version). Unfortunately the label has no policy about how the label is constructed (eg "Top rope" versus "top-rope"), however we could map all the labels to a language key. We could run a process which automatically created the language key and initial english version.

In the end the difference between the database config and true code keys will be how they make it to the online editing tool. The true code language keys come by running a script loading lang files stored in github, while the database config come in by running a script that looks up the database.

This leaves us to manage real user data in the database proper - url stubs, route names and area descriptions.

brendanheywood commented 8 years ago

Ok that all sounds good - if we are happy with language stuff being baked into code, and releasing it is effectively like a hot patch the same as we do for quick article tweaks or minor template tweaks then I'd recommend we:

scd commented 8 years ago

build the editing tool into thecrag itself, not as a separate tool. We get authentication, roles and all the other good stuff for free.

+1

We make a 'translator role' and give it to selected people. If they are in this role then if the template tries to render a string which hasn't been translated, instead of rendering the fallback in english it would render a big red square, and you can edit it inline. Would make the translators job very easy, just go to a page, look for all the awful red bits and fill them in. It gives them a lot of context but they would also go to a more raw interface to see all the keys and strings to see the edge cases they've missed.

nice

What about online forms, javascript notices, etc? Maybe we can consider the context of where the information will be presented. If it is in a select list then we default to english. In otherwords control this via an optional parameter.

all the tool does is read and write files on disc, there is no DB involved at all. We just have to make sure this is pretty solid, does file locking etc

Just to confirm you are strongly tied to getting all language translations back into git hub. This is why you do not want a db.

We have a number of options here, but I think I favor json files. We are doing so much with JSON formats already. Maybe a structure something like

translations
translations/en
translations/en/email.signup  # manually managed by developer
translations/en/templates.facets
translations/en/database.config  # output of loading keys from database config
translations/en/api # stuff from the api developers
translations/de
translations/de/email.signup # created by online tool - dups keys from the english version
translations/de/templates.facets
...

The file could be a simple json hash { "email.signup.welcome_paragraph": "Welcome to...", ... }

Reading the files will be simple matter of reading all files into a perl hash which will have keys like "en:email.signup.welcome_paragraph"

I think we should just be able to read these into memory, but if we cannot we could probably tie the hash to a berkley db frontended with memcache. This will be pretty fast. This is what I am doing with the node cache so it is well proven on our system.

after a translator has added a bunch of stuff, we just commit it, and it gets released as normal, or we can do a hot patch. Either way we are doing the same as normal as this is pretty solid process.

So we will have to put git on prod?

we can flexibly change work flow, and let people do this in production. We just need to merge the files back in the other direction so they aren't lost next upgrade.

+1

So we are really really close. I think Nicky will still want to only load translations that have something to do with the app. For example he will not want all the system email translations.

Maybe the memory of uploading everything is trivial anyway, or he can select the files he wants to upload to avoid the emails.

What about versioning so the app knows to upload a new translations file. If we need versioning then we can probably work out ways of managing this using the file structure.

brendanheywood commented 8 years ago

What about online forms, javascript notices, etc?

We won't ever get it all, but for most template stuff this would be quick and convenient.

Just to confirm you are strongly tied to getting all language translations back into git hub. This is why you do not want a db.

I'm not particularly tied to github at all, really whether it is in git or not is completely independent. From the systems points of view it is just a file, the same as the articles, the templates and the email templates etc. The edit system will have no direct visibility of or dependence on git or github.

I not against it being in the DB but I just don't see any point and it adds complexity for no benefit. If we throw it into the db then we have to build our own versioning on top of it inside the DB (which I know is a semi solved problem but still).

I think we should just be able to read these into memory, but if we cannot we could probably tie the hash to a berkley db frontended with memcache.

Is there much of a downside to just using perl files natively as the format like moodle just uses raw php? Translators will not be editing these directly so can't mess it up. And it's super simple and dirt fast, we don't have to worry about json + bdb + memcache and all these layers that just don't directly add any value. Just need to be a tad careful with file locking etc

This might seem like a downside for the app, but the app (or JS land) will only ever grab the lang packs via the api which would export into the format that was needed, ie into the standard java lang attributes format.

Maybe the memory of uploading everything is trivial anyway

The number of keys / strings will always be finite and not that big. Only the 'data' stuff will expand on forever. I suspect the worst case of the app pulling in the entire lang pack for everything will be smaller byte wise than a single topo image. Having just a couple files eg 'core', 'email', 'app' should be more than sufficient granularity.

So we will have to put git on prod?

git is on prod. Or do you mean that prod is a checked out code repo vs being rsynced? in which case yes, but we can also handle this in other ways, and we can cross this bridge much later. Anyway I much prefer translators working in dev where they can break stuff and refine it incrementally.

scd commented 8 years ago

Is there much of a downside to just using perl files natively as the format like moodle just uses raw php? Translators will not be editing these directly so can't mess it up. And it's super simple and dirt fast, we don't have to worry about json + bdb + memcache and all these layers that just don't directly add any value. Just need to be a tad careful with file locking etc

I have done this a lot in main system, creating perl files from db data. I created too many and it caused a bit of a memory problem, because of the way perl modules were loaded. We are not going to be creating that many files and I guess I am just showing my scars. So happy to do it as native perl.

nicHoch commented 8 years ago

get_string('ticktype.'.$ticktype) ++1 from me We do it the same way in the WordPress plugin. We have to separate data and translation as much as possible

Am 30. Mai 2016 03:44:02 MESZ, schrieb Brendan Heywood notifications@github.com:

Well this depends on a lot on how we hand roll it. In most projects, eg taking moodle again as an example, all en lang packs are source code and under version control in git. So converting an old english string is just moving text from one file to another.

Moodle has a very separate online tool which the translators use to create the language packs, which are available via an api, and moodle can grab them on the fly as needed. Moodle is sorta like your hybrid idea, in that in code it has both the key and the English version, but the crucial difference is that the English is refactored in the lang page file and not in the original template. Other projects use similar tools but store all the lang packs in git side by side. This latter is much more common.

But I think you want to stick it into the DB instead, so not really sure. A few options come to mind:

a) we still put everything into git, the online tool simply works on dev and reads into DB and then writes these files from DB. So the effect is instant, but only on dev, but this also means people can tweak and refine and iterate and finally get it right, and then only release the translations at each release. This is how I'd say the vast majority of projects work and gives them a lot of assurance over what is being released. Ultimately I think we will end up with a small team of fairly intimately involved translators, so this may be very feasibly. Live editing lang strings is a massive vector for abuse and accidents, but easily solved too. eg in moodle lang strings are just a normal array in a php file, you get all the code file compiling and caching for free.

See: https://docs.moodle.org/dev/AMOS_manual#Translation_workflow

b) we create a new separate online tool with it's own DB, both dev and prod both load from it, but we'd probably have some caching so we'd just need to clear the prod cache after a translator has done their thing. This would give us a pseudo release process, but be independent from the code releases.

A left field thought, things like tick types, are sort of in this gray area between code and data. We can easily push it either way, ie we could use get_string('ticktype.'.$ticktype) so we don't need to touch the tick type tables to add a lang column, and all the lang pack editing is done in one place instead of doing some in files, and some in the db (although the lang edit tool could hide this). It may be easier to force more things into the 'code' camp than the 'data' camp, leaving only real data for a different process (eg node names, descriptions etc)


You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/theCrag/website/issues/2208#issuecomment-222396626

www.thecrag.com

Nicky Hochmuth www.thecrag.com/climber/nickyhochmuth

nicHoch commented 8 years ago

Very good. Especially the inline editing.

Am 30. Mai 2016 05:12:01 MESZ, schrieb Brendan Heywood notifications@github.com:

Ok that all sounds good - if we are happy with language stuff being baked into code, and releasing it is effectively like a hot patch the same as we do for quick article tweaks or minor template tweaks then I'd recommend we:

  • build the editing tool into thecrag itself, not as a separate tool. We get authentication, roles and all the other good stuff for free.
  • We make a 'translator role' and give it to selected people. If they are in this role then if the template tries to render a string which hasn't been translated, instead of rendering the fallback in english it would render a big red square, and you can edit it inline. Would make the translators job very easy, just go to a page, look for all the awful red bits and fill them in. It gives them a lot of context but they would also go to a more raw interface to see all the keys and strings to see the edge cases they've missed.
  • all the tool does is read and write files on disc, there is no DB involved at all. We just have to make sure this is pretty solid, does file locking etc
  • after a translator has added a bunch of stuff, we just commit it, and it gets released as normal, or we can do a hot patch. Either way we are doing the same as normal as this is pretty solid process.
  • we can flexibly change work flow, and let people do this in production. We just need to merge the files back in the other direction so they aren't lost next upgrade.

You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/theCrag/website/issues/2208#issuecomment-222403652

www.thecrag.com

Nicky Hochmuth www.thecrag.com/climber/nickyhochmuth

nicHoch commented 8 years ago

I would propose a separate file format like csv as the base persistent format. In Native pearl it is hard to code (or at least is a waste of memory) meta data like coments, author, used in app ... Than we can compile it automatically after editing.

Am 30. Mai 2016 07:17:58 MESZ, schrieb Simon Dale notifications@github.com:

Is there much of a downside to just using perl files natively as the format like moodle just uses raw php? Translators will not be editing these directly so can't mess it up. And it's super simple and dirt fast, we don't have to worry about json + bdb + memcache and all these layers that just don't directly add any value. Just need to be a tad careful with file locking etc

I have done this a lot in main system, creating perl files from db data. I created too many and it caused a bit of a memory problem, because of the way perl modules were loaded. We are not going to be creating that many files and I guess I am just showing my scars. So happy to do it as native perl.


You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/theCrag/website/issues/2208#issuecomment-222413188

www.thecrag.com

Nicky Hochmuth www.thecrag.com/climber/nickyhochmuth

scd commented 8 years ago

At the moment I am playing around with perl files which look like this

'some.key' => 'Some text associated with the key'

In otherwords just a straight perl hash.

An equivalent JSON file would look like this:

{
  'some.key': 'Some text associated with the key'
}

It is pretty marginal, and I don't think the native constructs of loading perl files is any simplier than loading json files. I am happy to go with either of these. Probably not csv though.

Do we want to record who updated the translation using the online editor?

If so then this could make the process of saving updated hashes using the online editor a little more complex, because I don't want to parse comments. We could do something like

'test.key': {
  'text': 'this is the updated string',
  'epoch': 1234567,
  'who': 'scd',
  'history': [['scd',1234, 'original string']]
  # other meta data
}
nicHoch commented 8 years ago

History will come on its own by git.

an other property (optional):

'plural' : "$1 ascents"

brendanheywood commented 8 years ago

We won't get history from git as the person who entered the data online may not be the same as the person who later commits the file. That said I don't think we need it.

Plurals shouldn't be an extra attribute, it should be a whole separate key / string. Same logic applies to 'No ascents' or similar

nicHoch commented 8 years ago

The decision what phrase to use if 0, 1 or many Items are present should come from a single method like:

getText_Count(key, count).

therefore the method should be able to select the right translation depending only on the key and the actual number.

Otherwise you have to do the very same logic on every template over and over again:

if count_ascents > 1 getText(key_many, count). else count_ascents = 1 getText(key_one, count). else count_ascents = 0 getText(key_none, count).

is a "" context often a "n" function is used.

scd commented 8 years ago

I think both of you are right. From a keys perspective we store keys:

ascent.summary: You have 1 ascent
ascent.summary.plural: You have {count} ascents

But in the code we have

getText('ascent.summary',count=>$count)

@nicHoch in your example you had three versions with a zero set. Do we want an empty list version as well

ascent.summary.empty: No ascents

The third one seems useful but non standard?

brendanheywood commented 8 years ago

Go and read the documentation for maketext and you'll see that this is many orders of magnitude more complex than you think. Different languages have different grammar, and in english we happen to have special cases for zero, 1 and many, but other languages have much more complex rules depending on say, the gender of the person saying something, the gender of the person being spoken to, or much more complex rules.

These are grammar cases, and English has lost almost all of it's cases over time so we are mostly ignorant of them: https://en.wikipedia.org/wiki/Grammatical_case

What we want to avoid is building into the api of our language system any assumptions about the grammar of any particular language. gettext takes a pretty lazy ad hoc approach and assumes you've already handled edge cases via utility functions that have already sorted out any kinks before you pass the data into the language system. eg moodle falls into this camp and provides utilities for format_time() etc. ie it's general philosophy is to not worry about it and do simple workarounds as each new edge case crops up. It's not elegant but it's simple and pragmatic.

maketext takes the approach that all of this should be part of the language system itself, it's existance is because gettext didn't handle so many cases. So it provides the utility functions for you:

http://search.cpan.org/~toddr/Locale-Maketext/lib/Locale/Maketext.pod#Utility_Methods

maketext's approach is technically the more correct one (but still imperfect), but it means that the translators are not working with a string, but instead are defining a function. This add's a lot of complexity, which is ok if coders are the ones doing the translating but adds a high barrier to entry for non-coders. It is also a pain in the arse because the function is defined in perl, so it's not portable.

A much better solution would be to define the function in a simple but sufficiently expressive and portable templating language (maybe handlebars?), which you can then compile down into perl / js / java as needed. Then most strings just look like strings, but the expression power is there if you need it to do crazy grammar logic when you need it. But then you also get a bit of a performance hit as well.

I'm not particularly wed to any approach, as usual I'm much more interested in the end result, and the ease of use for the translators, than the system internals. Perhaps we should do a prototype using handlebars, in particular getting it to compile down to native perl / java, do some performance testing, and see how viable it is.

The real test for this will be when we tackle translating the complex sentences we've built up in the stream event summaries. Perhaps that should be our proof of concept test for each potential solution.

Also another thing which we've skirted around the edges here is that whatever system we use should be able to call other lang strings, or templates.

ie if I've in a facet page for routes I'd just want to be able to say get('facet.routes.count', $size) and for it's template to internally call another template which handles the logic around negative, zero, 1, or many. Something like:

"hello.world" => "Hello world!" <-- example showing that simple strings are still simple
"facet.routes.count" => "There are {{> util.number,
    size => size, singlar => 'route', plural => 'routes' }}"

"util.number" => "
    {{size == 0 ? 'No {{plural}}}}
    {{size == 1 ? 'One {{singlar}}}}
    {{size > 1 ? size + {{plural}}}}"

NOTE: That above is pseudo code, not handlebars, I'm not even sure if handlebars is expressive enough, we may need to find a more expressive template language if want to explore this path.

scd commented 8 years ago

Have you read Rassie's critique on maketext and gettext.

http://rassie.org/archives/247

I want to avoid complexity for the translators. What is wrong with having the following convention for translators:

"facet.routes.count.one" => "There is one route"
"facet.routes.count.zero" => "There are no routes"
"facet.routes.count.many" => "There are {count} routes"
"facet.routes.count.negative" => "WTF"

and in code you just call get('facet.routes.count',$size). The get function will work out whether to lookup facet.routes.count, facet.routes.count.zero, or facet.routes.count.many keys.

The output will be exactly the same and it is far simpler for translators. The translator will still have to know what '{count}' means, but I don't think we can get away from having variables in the text.

brendanheywood commented 8 years ago

Yeah I did read that ages ago.

I totally agree that get('facet.routes.count',$size) is the ideal API for a template to call. The question here is: are we solving this specific case for numbers where the grammar in most languages calls for different words based on the number involved, OR are we coming up with a solution that can be used to flexibly solve whole classes of situations like this where the resulting text is based of any number of attributes, such as number size, time, gender, location, rank etc

Solving this particular issue with the size of numbers is pretty trivial, almost anything could work sufficiently. Instead what I'm asking is whether we want to invest the energy to get a proper solution in place which can handle the entire class of this type of translation issue. If we do, then we need to use an expression language. If we don't then we can stick with simpler strings and put a few smarts into the get() function to workaround the most common quirks like numbers, time deltas, gender.

If we can find a simple but expressive enough language, that is also performant enough, then I'd prefer to go down that route as it will mean all the quirks we find later with translations can be solved in one spot in the language expressions themselves, and not in a bunch of utility functions which will need to be ported to every language we use (which will be at minimum perl, js and java).

Consider a very simple (for a human) sentence which could turn up in a stream:

"Brendan was at Arapiles yesterday and he ticked 4 routes"

This vocab used in this sentence is dependant on subject gender, action time, and object quantity. Do we want to have to have 3 utility functions to handle each of these cases and their various permutations? Or do we just want to have a flexible enough language system that it can handle all of this without resorting to any workarounds?

At the moment we have tons of English specific word logic to handle this tied up in both mason templates as well as a few perl utility functions. If we want to translate this, or even just port it to JS / Java without translating it, then we need to untangle it cleanly into one portable expression language.

scd commented 8 years ago

Good idea, let's make sure we can solve the hard stuff. Where did your example come from?

I think the stream event summaries are a good place to start. I have taken 5 examples from a tick event in the stream today (names replaced by myself)

These represent fairly good coverage of the template code.

In the template the summary text is built by combining list of action components (eg 'lead a route').

So the overall summary looks like this:

event.tick.actions.summary = '{person} {actions} at {place}'

This can be called in the template using get('event.tick.actions.summary',person=>$person,actions=>$actionstext,place=>$place)

Creating a sentence out of a list is something that we are going to need over and over again. Interestingly there is a built in ruby on rails function for this, called to_sentence. Maybe we could do something similar:

list.to.sentence.single => '{item}'
list.to.sentence.last => '{list} and {last}'
list.to.sentence.join => ','

In the code we can create the actions text using get('list.to.sentence',@actions).

This then leaves the template to work out the text for each action. Currently the template does this with an if statement for different cases which would tie into the following translation functions:

action.tick.generic = 'went climbing'
action.tick.historical.one = 'logged an ascent from the past'
action.tick.historical.many = 'logged {count} ascents from the past'
action.ticked.one => '{ticked} a route'
action.ticked.many => '{ticked} {count} routes'

So this works without creating too much complexity in the translations and is close to a drop in replacement in the template.

I understand that putting more logic in the translation system could lead to something like get('event.summary',$event), but we really don't want to be pushing this level of logic to the translators.

Maybe there is a compromise where we define some helper functions which could be accessed via the API or in perl land. For example we could implement a getEventSummary($event) helper function.

brendanheywood commented 8 years ago

I understand that putting more logic in the translation system could lead to something like get('event.summary',$event), but we really don't want to be pushing this level of logic to the translators.

Maybe there is a compromise where we define some helper functions which could be accessed via the API or in perl land. For example we could implement a getEventSummary($event) helper function.

Yes and no: I agree it's better if we don't push super complex stuff onto a translator, but it may still be better technically, functionally and for portability if it's still done inside the language templates where they do conceptually belong. It's just that the small number of more complex language templates are something that ourselves help setup for each language and then the translator fills in the rest. So the compromise isn't a technical one but a business process one.

scd commented 8 years ago

I am so not convinced. Let's use this work with Ulf to establish some use cases and best practices. Part of the issue is that I have not idea of sentence construction in other languages, and some of the experts are saying not to do it this way.

I think we need hard examples in consultation with translators. My gut is saying that translators should not be working with if statements and for loops. If this is required, then it should be in the realm of the developers, in which case there may be a case for a richer set of constructs that we use natively in perl.

On a side issue, I think that displaying currency, dates and nice numbers will not be done in language translations but rather natively.

I just don't know and am not keen to spend extra time building functionality that I am not sure we need.

95% of the translations will be the simple case, which will be the same regardless. So let's press on with the simple model for now.

scd commented 8 years ago

I am happy with using getText as our main function in the code. I don't want to go anything more obscure such as . If we were using full english texts as keys I would have prefered the function.

This means we will have something like

  $title = getText('templates.main.title')

If we want a count then I am suggesting something like

  $something = getText('templates.something.summary',{count=>1})

Where the hash ref means substitute keys for values in the resulting string. I put it in a has so we can still pass options using our standard technique. Eg

  $something = getText('templates.something.summary',{count=>1},missingError=>1)

In the above example missingError will tell getText to return an error class linking to the translation ui. This will not be appropriate in all cases, but typically the same from call to call, so it will be cached for next call.

I am also thing that the 'count' replacement hash could run the number through a localised helper function on numbers. Similar for something like date.

Anyway I am just starting a brain dump for discussion purposes.

scd commented 8 years ago

Should we be doing the terms and conditions? There are a couple of documents that are versioned, which means if they change then the version changes. If we do language translation then the versioning will not be happening for different languages.

We need to think of a way of managing this. The T&C need translations, and I want to use our translation tool to do it.

What I am thinking of is using a md5 hash on the english text key. One hash per paragraph. For example

terms.3403fes393943eac

This could be generated automatically and old one removed from the english translations.

brendanheywood commented 8 years ago

Yes but I don't see much difference between it and other strings it is just bigger. I like the idea of breaking it up to make it easy to manage but I would key it on each section name and not a sentence hash for the same reasons we don't want to use English as the key. There are always going to be strings we consider critical and which must be translated in order to consider a translation viable. t&c isn't something we'd want to wait for the community to chip in and fill out, we will need to ask specific translators to ensure it and other critical stuff is done as part of a release.

It may be useful to show translators the diff with a previous prod version to help them quickly update with minimum work.

On a semi related topic it would be really nice for when the terms change for users to see the diff from what they previously may have agreed to, so this makes me think this use case is more like crag data than versioned code. More websites are doing this as their term grow massively big. The system needs to be able to access all versions from all time not just the latest.

brendanheywood commented 8 years ago

Maybe a pragmatic hybrid could be a version number in the Lang key, eg 'policy.siteusage.about.v3'

We'd still use the Lang tool to translate but it would have the smarts to recognise the version suffix and show diffs from the previous version. We'd do a one time export of all previous versions as we migrate them into the Lang system and cut them up unit smaller chunks.

brendanheywood commented 8 years ago

And of course getString would only grab the latest version. The presence of a version number would be a perfect heuristic to present to translators all the critical strings to translate first.

And following on from that, it would be good to log calls to get string so we can prioritise translations of the most used strings.

scd commented 8 years ago

Yes we have to do all versions in all languages. But we probably don't have to back populate old versions when we get a new language translator.

I will have a look at the database fields. Ideally I don't want to have to do anything to get T&C into the translation files. It should all be automatic. I think there are clause numbers in the database.

scd commented 8 years ago

I have implemented an 'xx' language for debugging. If you select the xx language (eg url parameter) then every get text call returns xx. This is a really good way to work out what texts still need to be included in the translation database.

brendanheywood commented 8 years ago

But we probably don't have to back populate old versions when we get a new language translator.

That's not quite what I meant, this was only to help with showing a diff to either the end user accepting the terms, or to the translator to help them amend the latest terms. It won't help right now with the inital translation but will help when we next update out terms and need to quickly get that pushed out to all other languages.

scd commented 8 years ago

. It won't help right now with the inital translation but will help when we next update out terms and need to quickly get that pushed out to all other languages

Yes we will have to work out that work flow efficiently. We won't be able to release changes in T&C without all translations complete. Internally the T&C is managed as a series of clauses in a versioned document.

scd commented 8 years ago

I am closing this one because I think we have selected the overall framework.