learningequality / ka-lite

KA Lite: lightweight web server for serving core Khan Academy content (videos and exercises) without needing internet connectivity
https://learningequality.org/ka-lite/
Other
458 stars 304 forks source link

Fetch translated exercises #2729

Closed aronasorman closed 9 years ago

aronasorman commented 10 years ago

Right now we only support officially internationalized KA sites like fr, pt-br and es. We wanna expand that to any user-translated exercises like de. We should also fetch translations for the Perseus exercises.

jesumer commented 9 years ago

Hi @aronasorman,

How to replicate this issue? Or how do the fetching translations works?

aronasorman commented 9 years ago

@jesumer the languagepackdownload management command is the file you're looking for!

cpauya commented 9 years ago

Hi, am attaching the email of @jamalex relating to this issue along with the scripts attached for reference. As @aronasorman said - after due modifications, we may be able to open-source these!

FYI: I can't attach *.py or *.zip files here so I added it on Google Drive and here's the link to the folder with the scripts: https://drive.google.com/open?id=0Bwf_-YL9LvlVN2lVX1hEeWRRRlE&authuser=0

Start of Email History:

From: Aron Asor
Date: Mon, Dec 20, 2014 at 2:10 AM
This seems pretty tricky. I'll go ahead and read their stuff, and I'll give 
an assessment in a bit on how hard it is to add this to our pipeline. 
Best case, we open-source this before they do!

---------- Forwarded message ----------
From: James Irwin 
Date: Mon, Dec 15, 2014 at 7:45 PM
Subject: Re: Internationalization tools for khan-exercises and Perseus
To: Jamie Alexandre
Cc: Ben Eater, Aron Asor, Richard Tibbles

Hi Jamie,

Sorry for delayed response.  Unfortunately the scripts to build the
translated version of the html files are pretty intertwined with the
rest of our compiler and it is not easy to remove.  I have attached
exercises/babel.py and kake/translate_exercises.py which uses
kake/translate_javascript.  We would like to refactor and open source
this at sometime, but its a substantial amount of work so we can't do
it soon.  With some reworking you should be able to use the
translate() function in translate_exercises to create the files for
the other languages.

In terms of translating perseus questions you have the right idea.  It
also has some trickiness though.  I've added our
assessment_items/models.py file which has a
traverse_natural_language_parts() method and an
assessment_items/i18n.py file which has a function
translate_serialized_assessment_item that will translate the fully
parsed item.data.  It also translates decimals automatically in case
translators have not done so.

Hope this helps and you can get it working for all languages.  Let me
know if you have any further questions here.

Best,
James

End of Email History

aronasorman commented 9 years ago

To expound on this, there needs to be two translation systems, one for khan-exercises and another for Perseus. The difference stems from how the questions are stored.

For khan-exercises, the questions are stored in the static html files. My guess is that the way we wanna translate here is to replace the text statically and write that to the zip, or to a folder and then zipped up.

For Perseus, exercises questions are stored inside ka-lite/data/khan/assessmentitems.json. It's a JSON map, with each item's item_data containing the data for the question. You'll want to create a function that takes in the assessmentitems.json in English and then outputs a new json file that has been translated.

I'm not exactly sure what James' py files contain, but it should provide some insight on how their i18n pipeline works. We might wanna start with tackling the Perseus exercises, since that might be easier, and then open another issue for khan-exercises.

jesumer commented 9 years ago

HI @aronasorman,

I have verified the languagepackdownload command and used "de" as a language. I have noticed that the languagepackdowload processes has something wrong. As we can see, I had already downloaded the de language pack from https://learningequality.org into my local.(see the screenshot) screen shot 2015-01-08 at 5 38 12 pm

And then it will supposedly work the processes. In my basis it will supposedly added the "de" folder at ka-lite/kalite/i18n/static/khan-exercises.(see the screenshot) screen shot 2015-01-08 at 5 37 19 pm

aronasorman commented 9 years ago

Did you set German as your default language?

aronasorman commented 9 years ago

It should've been added into the ka-lite/locale directory. Since there are no translated exercises for German, there will be no de folder in khan-exercises.

jesumer commented 9 years ago

Hi @aronasorman,

Ah I see. Yeah I set German which is "de".

jesumer commented 9 years ago

Hi @aronasorman,

My Updates: Cyril helped me out on how the "languagepackdownload" command works on the local central server and "update_language" command on the local distributed. On distributed side, we had sometime figuring out how the "select language pack" dropdown works and why the dropdown doesn't have the languages or it's empty. Then after, we figured out it will used the static folder. The static folder should have the data folder which contains the "language_pack_availability.json" file which is used to populate the dropdown menu. The populate_installable_lang_pack_dd() js fuction is used for the json population. I am not yet finish on verifying and tracing. I am not yet finish on the "python manage.py update_language_packs --no-dubbed --no-ka-trans --no-srts --no-exercises" and currently downloading. Will continue the issue after. Thank you

jesumer commented 9 years ago

Findings with Cyril.

We tackled the contentload command at distributed and we tackled the update_language_packs at the central server. We found out that the json's(exercises.json, topics.json, etc) at kalite/data/khan at distributed was from khan academy api which doesn't do at the central server.

I do the search about how the assessment.json works and end up kalite/distributed/static/perseus/get_all_items.py and playing around the urls used like changing the language code at the end.

We propose to create the app which is

This will be discuss with Cyril for further details about the proposal.

cpauya commented 9 years ago

Hi @aronasorman - as per our last talk with @jesumer, here are the things we need to do for this issue:

  1. At the central - use update_language_pack to download Khan Academy strings from Crowdin to build the "deutsch language pack".
  2. At the distributed - use languagepackdownload -l de to download the language pack
  3. Check which of the exercise templates / scripts needs to be i18nize so that the translated strings are loaded.
jesumer commented 9 years ago

Hi All,

As what my progress yesterday I was able to replicate and determine what are the strings that needs to have internationalized at the perseus exercise as seen in the screenshot. screen shot 2015-01-22 at 9 22 09 am

And here is where I planned to get the strings from Asssesment Item: screen shot 2015-01-22 at 9 19 38 am

Now the problem is the assessment item has special characters on it which I think it was important. Here is the sample item json data: screen shot 2015-01-22 at 9 18 33 am

Cyril suggest to put "gettext()" into the template in which I figure out where it was happen or where should I find that template.

cpauya commented 9 years ago

@jesumer Let's take a look at the scripts that @jamalex forwarded as @aronasorman suggested.

Let's ditch the gettext() javascript suggestion I made earlier and see if:

  1. We can find how the Khan devs did it on their scripts;
  2. Integrate those into our codebase so we show the translated strings on the exercises.
jesumer commented 9 years ago

Hi All,

I was able to use the i18n.py script from James at Khan Academy. I noticed of using this import modules:

  1. import api.jsonify
    • Which used on jsonify.as_serializable(assessment_item) at def is_fully_translated() function.
  2. from intl import i18n
    • Which uses i18n._(text) at def _maybe_translate() function, i18n.request_language_decimal_format(), i18n.format_decimal() at def _translate_number() function
  3. import intl.regexps
    • Which uses regexps.NO_NEED_TO_TRANSLATE.match(text)
  4. import intl.request
    • Which uses request.locale_for_mo(), request.jipt_locale_for_mo().

Some of the function are important like i18n._(text), i18n.request_language_decimal_format(), i18n.format_decimal(). I need to know where I can get this. Thank you.

jesumer commented 9 years ago

Updates:

I pulled the develop branch update and successfully run the perseus exercises.Iassessment.json, I tried wrapped it like _(ASSESSMENTITEMS) but that doesn't work and get errors because of some special characters inside like '\r\n', \n', '\r', and '\n' so Now, I started coding the assessment json to have it translatable or wrapping the strings into () function.

Jesumer

jesumer commented 9 years ago

Updates and findings:

I have tried to find out the sample German language strings that has been approved from crowdin. screen shot 2015-02-02 at 4 10 22 pm and this is for french language screen shot 2015-02-02 at 4 42 33 pm

I set the german language and this is the output into our site: screen shot 2015-02-02 at 4 11 58 pm And for the french: screen shot 2015-02-02 at 4 42 15 pm

I also tried to find out in the python shell. Here is the german language sample: screen shot 2015-02-02 at 5 19 52 pm

And the french: screen shot 2015-02-02 at 5 15 38 pm

Then after the "languagepackdownload" in distributed, I check out and find the strings(e.g "Create a picture graph to show how many teeth each student has lost.") in the mo file located at ka-lite/locale/de/LC_MESSAGES/ and also at ka-lite/locale/fr/LCMESSAGES/ but unfortunately it isn't there and the string doesn't exist. Is there anything that I missed?. Base of my understanding, the mo files from central server has all the translated strings accordingly to our language we set. I also have the python scripts to enclose the non-translated string from our assessment items into gettext or ().

Best, Jesumer

cpauya commented 9 years ago

Hi @jesumer - you must copy the whole untranslated string from Crowdin (not portion only) and use that on your Python shell. Example based on your link above:

https://crowdin.com/translate/khanacademy/27617/enus-de

**Create a picture graph to show how many seeds Johnny Appleseed planted in each location.**\n\nLocation | Apple seeds \n- | :-: | -\nCharleston | $2$ \nLouisville | $6$ \nRichmond | $5$  \nSpringfield | $6$  \n\n![](https://ka-perseus-images.s3.amazonaws.com/a5b68232872a3e078a942f9c298e815b2a92f4e9.png)\n\n[[☃ plotter 1]]\n
cpauya commented 9 years ago

@jesumer I have just filed a related issue at https://github.com/fle-internal/ka-lite-central/issues/225 on central repo.

Please verify if that indeed affects/fixes this issue.

jesumer commented 9 years ago

Hi,

We have already downloaded the translated po files at the central server we can replicate this in our distributed console and have it translated. Here is the screenshot screen shot 2015-02-04 at 9 00 33 pm

This is the po files from crowdin and search the translated string("Telling time without labels")" screen shot 2015-02-04 at 9 01 50 pm

Now the problem is we can't find the string("Telling time without labels") into the distributed browser. I think the assessment items json is the problem.

Here is the topic tree screen shot 2015-02-04 at 9 06 51 pm

And here is the result at the browser screen shot 2015-02-04 at 9 06 59 pm

Jesumer

jesumer commented 9 years ago

Hi,

I'm working with the multiple hints.

Jesumer

jesumer commented 9 years ago

Hi,

I'm done with the multiple hints.

Jesumer

jesumer commented 9 years ago

Hi all,

I have already refactored my script at get_assessment_item_data. We can now use it. Thanks

Jesumer

jesumer commented 9 years ago

Hi,

After running the perseus exercise that is default to de language code, here is my example working scripts at the browser. screen shot 2015-02-05 at 2 48 20 pm

Notice the Answer pane at the right. It has a translated string to German language.

(Note that the perseus exercises are randomly change every after the page was refreshed. So just find any example of the perseus exercise that has already have the translated strings and test it to the browser.)

Jesumer

alani1 commented 9 years ago

DE is still working on the review and approval for geo, I recommend you use early-math exercises for testing as there you'll have already 95% of the strings translated and approved.

arceduardvincent commented 9 years ago

Hi @jesumer I will review this issue you made and the PR you made so this issue will be close.

aronasorman commented 9 years ago

HOLD UP! No one close this issue yet.

MCGallaspy commented 9 years ago

@aronasorman what remains before closing this issue?

aronasorman commented 9 years ago

I just merged it to make it easier to test the dummy language packs. Once the tests are fixed for that PR we can merge that, and everyone can test i18n.

aronasorman commented 9 years ago

With both dummy language packs and perseus exercises merged in, we can now properly test this issue.

I created a dummy language pack by running: bin/kalite manage create_dummy_language_pack

I then switched to Esperanto (the name for the dummy language), and got this: screen shot 2015-03-18 at 6 04 20 pm

So the interface is partially translated. However, when I open the exercises, I don't see any translated strings: screen shot 2015-03-18 at 6 03 52 pm

aronasorman commented 9 years ago

It seems like the strings aren't in the fetched po file in the first place.

aronasorman commented 9 years ago

Ah, seems like there's no en language in Khan Academy, so it might not be a good base language pack after all.

aronasorman commented 9 years ago

String is in django.po, but it's not getting substituted in the exercise: screen shot 2015-03-18 at 6 25 05 pm

aronasorman commented 9 years ago

Works in the terminal though: screen shot 2015-03-18 at 6 26 41 pm

aronasorman commented 9 years ago

Turns out I just needed to refetch the assessment items. I'm now getting this: screen shot 2015-03-19 at 9 18 17 am

Notice the accented instructions on the right side, but unaccented strings on the question area.

aronasorman commented 9 years ago

Hints are translated: screen shot 2015-03-19 at 9 19 39 am

aronasorman commented 9 years ago

Ok, I think I found the issue. Django's gettext can't seem to find the translated string, even though they're in the po file and they look exactly the same.

See this entry in the po file, with accents and all: screen shot 2015-03-19 at 9 34 31 am

But when I go to pdb, ugettext can't find it: screen shot 2015-03-19 at 9 34 17 am

jamalex commented 9 years ago

Looks like it could be an issue of how the pieces are being chunked up? Are we using that code that KA sent us?

aronasorman commented 9 years ago

@jamalex Yeah, all the exercise-related strings are from Khan Academy. They might be using a different system for localization?

jamalex commented 9 years ago

Right, the strings are from them. But they shared code with us that parsed item_data into the translatable chunks or something, right? The po file strings don't just contain the entire contents of each item_data field, do they?

jamalex commented 9 years ago

Looks like we just do: answerarea_content = _(answerarea_content)

Is that what KA does?

rtibbles commented 9 years ago

How would that work when we have changed the URLs, btw?

jamalex commented 9 years ago

How would that work when we have changed the URLs, btw?

Magic! https://github.com/learningequality/ka-lite/pull/3342/files#diff-7f19457f4d6f7fd73961c0d5d92a1dd5R132

aronasorman commented 9 years ago

I think the issue here is how python's gettext finds strings, as polib can find the exact same strings (and thus fetch the translations) while _ can't:

screen shot 2015-03-19 at 10 18 42 am

aronasorman commented 9 years ago

@jamalex we didn't get too much use out of KA's files, as we needed (i think) 5 more modules. I believe it's too tied to their code.

aronasorman commented 9 years ago

Took a bit of digging, but I finally found a dict that maps a string to its translated counterpart, as read by python's gettext:

from django.utils import translation
translation.activate('eo') # "eo" is the code for the DEBUG language.
catalog = translation.trans_real.catalog()._catalog

catalog is what you're looking for.

aronasorman commented 9 years ago

So, it turns out to be an encoding issue.

Django reads in PO files as ascii:

'**Which choices represent  the number $550$ ?**\n\n[[\xe2\x98\x83 radio 1]]'

However, we read in assessment items as unicode:

u'**Which choices represent  the number $550$ ?**\n\n[[\u2603 radio 1]]'

The solution is to encode the untranslated strings into ascii before passing them to the gettext function:

        question_content = _(item_data['question']['content'].encode('utf-8'))
aronasorman commented 9 years ago

And thus we get: screen shot 2015-03-19 at 11 26 54 am

So now the problem is the accenting module translating KA's pseudo-markup, breaking the exercise code.

rtibbles commented 9 years ago

Wooooot!

aronasorman commented 9 years ago

So I got this specific exercise translated:

screen shot 2015-03-19 at 12 38 30 pm

However, it looks like I'm gonna have to go through each type of exercise, find their structure, and translate them.

aronasorman commented 9 years ago

Just created a function that will automatically translate all types of questions \o/.

While waiting on #3342, I'll work on getting the other text translated. These are most likely related to backbone.js + djangojs issues.