Khan Academy - Githubissues

Popolechien commented 3 years ago

Website URL: https://www.khanacademy.org/
License: CC-by-nd-sa
Desired ZIM Title: Khan Academy
Desired ZIM Description: Build a deep, solid understanding in math, science, and more.
Desired ZIM Icon –png (URL or attach one): https://yt3.ggpht.com/a-/AN66SAxtrcGVSpJ8pDKArK8XL2OdgdntPDkH_L6FMA=s900-mo-c-c0xffffffff-rj-k-no
Language (ISO 639-3): There are a lot of languages available, the scraper will have to be able to differentiate.
Desired Main Page (homepage, if different from website URL): tbd

iharshit009 commented 3 years ago

Hi, I would like to work on this issue. Could you please elaborate on it, how do I need to proceed?

saumyaborwankar commented 3 years ago

Hello everyone, I got a hold of what needs to be done. Let me know if this is the right direction. So I have familiarized myself with the codebase for https://github.com/openzim/youtube Does Kiwix wants to do the same for Khan Academy videos?

kelson42 commented 3 years ago

The plan is to prepare the zim with kolibri2zim, therefore the challenge is to build a kolibri sushi chef recipe able to import the videos in all languages in kolibri.

saumyaborwankar commented 3 years ago

So just for my understanding, the plan is to make recipe like https://github.com/learningequality/sushi-chef-pradigi which can sync all the videos of khan academy to kolibri and kolibri2zim will make the zim file for the required course?

kelson42 commented 3 years ago

@saumyaborwankar Exact

rgaudin commented 3 years ago

No, as clearly stated on the gsoc page, the ultimate goal is to produce ZIM files for all Khan Academy languages.

Obviously, we don't want to start from scratch if we can leverage an existing project. We know there are offline packages of Khan Academy but we don't know them correctly. The first step is thus to find about all of those and document their strengths and limits ; including whether those support all KA languages or not (in the past, different languages were based on different platforms but that might have changed). We know that Kolibri have a few Khan Academy channels and we already have a kolibri2zim tool (although very new) but at this stage, we're not locked into the Kolibri scenario.

shashwat1002 commented 3 years ago

I had a few questions:

Is the organization of the zimfile to mimic the organization of KhanAcadamy completely? I ask this question because, as someone residing in India, I see the content for school organized by grades and not by topics specific to the Indian curricula and I'm not sure whether that is the case for users from other countries. However, the default organization of content is also visible to me.
Am I to assume that practice problems and such aren't required to be covered in this?
The Zimfile should have both the text modules and the videos right?

rgaudin commented 3 years ago

@shashwat1002, I have no strong opinion on this. I believe ZIM files should mimic the online version if possible. So Indian version might be by grades and other language by topics. But as long as it's easily accessible, I don't think that matters much.

No we want to cover as much as possible. ZIMs are offline and stateless so we can't save progression for instance. Even cookies are to be limited as users frequently share browsers. But we do want to support everything that can be with a reasonable effort.

Absolutely, videos are to be included as well.

TheFenrisLycaon commented 3 years ago

Is this still open to work?

rgaudin commented 3 years ago

Yes

RafayGhafoor commented 3 years ago

@rgaudin, What's the requirement to get participated in this project for GSoC, any mini-task that needs to be done?

TheFenrisLycaon commented 3 years ago

I'd like to take up this on for GSoC. Let me know how do we proceed.

rgaudin commented 3 years ago

Just follow the GSOC procedure for now

oshin94 commented 3 years ago

I too am interested to work on this.. Additionally, I would like to contribute to the Python Scrapper for NPTEL/SWAYAM. Both NPTEL and SWAYAM are Indian Massive open online course platform where courses are taught by the finest teachers from some of the best colleges/Universities in India. Having an offline version of such content will definitely help people in rural areas gain access to education and knowledge.

rgaudin commented 3 years ago

@imnitishng would you mind sharing the content of your GSOC proposal here? I think it's valuable information. If you could copy/paste here the main part, that'd be great.

Here's how I see it coming from there:

We should test the current output of the KA channels already. Please share the channel IDs of those and I'll run one or two with kolibri2zim in the zimfarm. Knowing the current status seems essential.
your proposal is not clear whether on the methodology. Are you proposing to start a sushi chef from scratch? I'd advise against that at this stage as LE's one seems to include a lot of work already. They do mention the CSV method so maybe we should work off that.
Could you list the limits/issues that LE's sushi-chef has regarding our objective?
Finally, as you mentioned, an important part is having the exercises working. When I started the kolibri2zim thing I couldn't easily extract a standalone version of perseus to bundle in kolibri2zim. Perseus is working in kolibri but it's intricate in its UI. Given perseus is abandonned, it's important we tackle this first.

So, my suggestion:

[ ] post proposal here
[ ] share channels IDs
[ ] generate current-status ZIM files for testing
[ ] add exercise support to kolibri2zim https://github.com/openzim/kolibri2zim/issues/6
[ ] assess what needs to be done

@Popolechien we need a list of deliverables for this ticket. As pointed in the proposal, LE's version doesn't support all languages and there are some specificity (regional curriculum). We need to know exactly which ones to build and if not supported by LE's version, we should understand why.

imnitishng commented 3 years ago

Full Proposal

Current State

Khan academy is currently offered in 37 languages (listed on their official website sharing the same format as english version) and 6 languages not listed on official website with different website format like this.
Kolibri makes use of its ricecooker library to convert content from various sources to Kolibri content channels. Integration scripts for Khan academy are available here sushi-chef-khan-academy. These scripts use the Khan academy API to fetch information using a Projection attribute to fetch specific attributes.

Problems

The problem with this approach is that Khan API is deprecated and there is no proper documentation available for the endpoints. Moreover, running the recipe does not work anymore since there have been some changes in the APIs and the attributes taken as input have changed a lot, rendering the whole script outdated as of now. This outdated script leads us to another problem which is outdated content on Kolibri, as far as I have observed through manual checks, there has been a lot of content missing, outdated and unorganized in Kolibri channels for different Khan Academy languages. Courses that are published on Kolibri have either missing videos or topics or reorganized content that has been moved under a different topic.
Kolibri does not offer content in every language that Khan Academy is available in, the languages offered in official catalog from LE are listed below

Proposal

Khan Academy has these TSV files hosted on Google Cloud they are provided by to their community partners, mostly used by translation teams to know which videos and exercises are translated, they seem to get updated almost every 4-5 days for each language, we would not have to worry about outdated content and they are available for every language.

The TSV (Tab-separated values) consists of rows that represent an entity from Khan Academy’s website. These entities can belong to one of the following values

Domain
Course 
Unit 
Lesson
Video
Exercise
Article
Interactive
Talkthrough
Challenge
Project

The TSV files are going to be our source of all the information, so the idea is to parse the TSV files into a topic tree. Such a format will ensure that every element will be downloaded and stored in the order that they are present on the source.

The tree would start from a dummy empty Node which expands downwards storing the information about different entities from the TSV files, all the cleaning and sanity checks for the input values will be performed in the tree-building process to avoid future problems. Once we have our tree we will start recursively iterating the tree in a Depth-First or Top-bottom fashion. For every node we encounter, we will run the respective functions to fetch all the relevant data from the nodes and render the HTML ultimately adding it to the ZIM file. Sample Tree structure:

Domain, Course, Unit, Lesson - are simple categorizing entities that will consist of just names
Video - Video Node would consist of information from the TSV file The videos will be fetched from Khan Academy servers URLs provided, in case if it is not available we also have a youtube mirror available that can be used to download the videos or the translated video using the youtube ID.
Article - An article is a theoretical document that is made up of several perseus(Khan/perseus) components, these components can be simple text blocks, exercises within articles, fancily formatted text, code blocks, and much more. The general solution to scraping all of this data is through extracting their respective perseus JSON mapping. A sample reference article for English and Czech language. We can extract the relevant information to replicate the article from the text response once a request is made to an article URL listed in the TSV data using simple REGEX statements on the response. A visual representation of the extracted JSON for perseus can be fed to Perseus javascript to render the article. A sample article from JSON can be seen on the tool provided by Khan Academy here.
Exercises - The way exercises are served on Khan Academy’s website is through their renderer Khan/perseus, which provides various standard widgets like Graphs, Videos, Interactive Plots, Multiple choices, Rendering Math, etc to create exercises using simple JSON syntax. Fortunately, the data extracted from TSV files already contains the necessary JSON formatting to build the concerning exercise. This makes the job a lot easier, we will only need to parse the information and feed it into the HTML files after an optional image compression step, everything else will be handled by the Javascript libraries and dependencies we will install to add support for perseus.

imnitishng commented 3 years ago

Here are the Khan Academy channels available on Kolibri Studio

NAME - ID - RESOURCE COUNT - SIZE
Khan Academy (en - English - Standardized Test Preparation) - 6616efc8aa604a308c8f5d18b00a1ce3 - 2,539  10 GB
Khan Academy (en - English - US curriculum) - c9d7f950ab6b5a1199e3d6c10d7f0103 - 9,971  54 GB
Khan Academy (en - English - CBSE India Curriculum) - 2fd54ca47a8f59c99fcefaaa3894c19e - 11,066 61 GB
Khan Academy (es - Español) - c1f2b7e6ac9f56a2bb44fa7a48b66dce - 6,550  27 GB
Khan Academy (my - Burmese) - 2b608c6fd4c35c34b7387e3dd7b53265 - 185    236 MB
Khan Academy (bg - Bulgarian) - 09ee940e106953a2b6716e1020a0ce3f - 6,018    25 GB
Khan Academy (gu - Gujrati) - 5357e52581c3567da4f56d56badfeac7 - 2,740  5 GB
Khan Academy (it - Italiano) - 801a5f02942055698918edcff6494185 - 1,296 4 GB
Khan Academy (fr - Français) - 878ec2e6f88c5c268b1be6f202833cd4 - 4,347 13 GB
Khan Academy (bn - Bengali) - a03496a6de095e7ba9d24291a487c78d - 1,536  2 GB
Khan Academy (hi - Hindi) - a53592c972a8594e9b695aa127493ff6 - 1,366    2 GB
Khan Academy (ki - Kiswahili) - ec164fee25ee526296e68f7c10b1e169 - 1,421    2 GB
Khan Academy (zh - Chinese) - ec599e77f9ad58028975e8a26e6f1821 - 1,557  7 GB
Khan Academy (km - Khmer) - f5b71417b1f657fca4d1aaecd23e4067 - 506  948 MB
Khan Academy (pt - Português (Brasil)) - 2ac071c4672354f2aa78953448f81e50 - 6,815   27 GB
Khan Academy (pt-pt - Português (Portugal)) - c3231d844f8d5bb1b4cbc6a7ddd91eb7 - 1,445  1 GB

imnitishng commented 3 years ago

your proposal is not clear whether on the methodology. Are you proposing to start a sushi chef from scratch? I'd advise against that at this stage as LE's one seems to include a lot of work already. They do mention the CSV method so maybe we should work off that.

Initially I thought of creating a separate scraper for KA just like youtube so the proposal is based on that approach. I do plan to work on top of what's already done in LE sushi chef, we can borrow a lot from there. I'd like to ask whether we should create a new scraper for Khan Academy or work on creating a recipe for kolibri and improve kolibri2zim to support Khan academy and all other content offered by different channels. Creating a recipe and importing it to kolibri then scraping that to ZIM files feels like an extra step but yes it does serve a broader purpose of supporting other kolibri channels.

Could you list the limits/issues that LE's sushi-chef has regarding our objective?

The problem with this approach is that Khan API is deprecated and there is no proper documentation available for the endpoints. Moreover, running the recipe does not work anymore since there have been some changes in the APIs and the attributes taken as input have changed a lot, rendering the whole script outdated as of now. This outdated script leads us to another problem which is outdated content on Kolibri, as far as I have observed through manual checks, there has been a lot of content missing, outdated and unorganized in Kolibri channels for different Khan Academy languages. Courses that are published on Kolibri have either missing videos or topics or reorganized content that has been moved under a different topic.
Kolibri does not offer content in every language that Khan Academy is available in, the languages offered in official catalog from LE are listed above.
As mentioned above, LE's sushi chef does not work anymore because of undocumented internal Khan API changes, in the following image I had to omit and correct some fields used in the recipe to get a valid response from the API.
However the branch sushi-chef-khan-academy/tsvkhan seems to work well and it utilizes the TSV files. I did run into some exceptions while testing it but I believe it should work fine after some fixes. This version of recipe supports 28 languages.

When I started the kolibri2zim thing I couldn't easily extract a standalone version of perseus to bundle in kolibri2zim. Perseus is working in kolibri but it's intricate in its UI. Given perseus is abandonned, it's important we tackle this first.

Yes I do understand that this will require most of our focus. I'd like to get some starting points on this in the mean time I'll start exploring perseus and try figure out a way to make it work standalone.

rgaudin commented 3 years ago

Thank you @imnitishng for this lengthy response.

I've created a recipe for KA-fr first. There seem to be a minor issue that I'll fix and run it again.

Regarding the scraper, the less we have to maintain in the long term, the better. So unless there are issues with the sushi-chef preventing us from achieving what we want and that can't be overcome easily, I'd say we use it. The tsvkhan branch of course. The double step is not an issue. We can bundle both steps into a single container later on.

Regarding the language issue, am I right to understand that the Kolibri Studio platform doesn't allow setting language codes for all the languages KA is available in? Is that it?

So I'd advise you start by looking into the exercise issue: https://github.com/openzim/kolibri2zim/issues/37

jameelkaisar commented 3 years ago

I was not aware of TSV files. My approach (Prototype) was to directly scrape the content from KA website (very naive approach compared to @imnitishng 's approach).

TSV files contain most of the required information. I am of the opinion we should also add some additional information like comments (maybe at a later stage).

Perseus and TSV files are a good starting point. I think we should start by creating a renderer for extracting information from TSV files. Then we can start working on openzim/kolibri2zim#37.

rgaudin commented 3 years ago

TSV files contain most of the required information. I am of the opinion we should also add some additional information like comments (maybe at a later stage).

Yeah we shall start with core functionalities first.

I think we should start by creating a renderer for extracting information from TSV files. Then we can start working on openzim/kolibri2zim#37.

What do you mean by renderer? My understanding (although I have not tested it) is that the sushi-chef (tsv branch) already creates the Kolibri tree and channel from those. We already have the kolibri to Zim renderer (very basic, doesn't support exercises).

It's important we secure exercises first. Yet since you are both willing to work on this, one of you can work on the perseus issue and the other one look at what needs to be fixed/improved with the sushi-chef TSV branch.

I've forked the LE's repo into https://github.com/openzim/khan-chef.

Zimfarm recipe is now running. Issue was due known yet undeployed fixed issue (https://github.com/openzim/python-scraperlib/commit/0ca222ff8bea8b5ba474c501203f7ee00ed508b1)

imnitishng commented 3 years ago

Hi @rgaudin Now that we have exercise nodes support in kolibri2zim https://github.com/openzim/kolibri2zim/pull/38, we can move forward with this.

Although we haven't discussed this much, but based on our past discussion here, I'll share my understanding of work that needs to be done to achieve all the project objectives.

Supporting multiple languages

sushi-chef-khan-academy now uses TSV files as the only way to build the kolibri content, hence we can bundle this in a container alongwith kolibri2zim to generate the ZIM files for different languages, but it takes a lot of time to build these (due to youtube API for subtitles, thumbnails etc). So maybe I'll have modify it a bit to run faster locally (for testing purposes), or use a language with less content to fetch.
Once we have validated that sushi-chef works for our use case, we'll be creating kolibri channels out of them. So since currently kolibri2zim uses channels from https://studio.learningequality.org, I wanted to ask where we would be storing the channels that we create.
After this is finalized we can move on to creating the ZIM files using these channels and integrating the scraper to zimfarm which seems straightforward, I just have to get familiar with zimfarm a bit more.

Moreover, we have KA articles that also use perseus rendering and they aren't supported in both sushi-chef-khan-academy and kolibri2zim, so this can be another objective for the project as well.

Let me know if you have anything more to add to this.

rgaudin commented 3 years ago

Indeed, sushi-chef merged the tsvkhan branch into master. We should do the same on our fork.

I don't think bundling both tools into one is pertinent now: the sushi-chef is probably fragile and will need some fixes so it's unlikely this will go smoothly all the way through a working ZIM file. The first objective is thus probably your next one: using the sushi-chef to build a kolibri channel. We can surely use the smallest available language for dev.

Our goal is to upload our channels to the official kolibri studio. We have a 30GB account there with 7GB used ATM. If we can create a good quality khan channel, we can surely request a storage upgrade to upload those that we target: EN, FR and AR.

Once we have the chef able to produce EN and FR, we'll have to look into making it work with AR which is not in the list of supported languages. I have no idea why, as the TSV do contain AR files.

Then, we'll need to integrate those articles. I understand those are quite important but we need to fit into the Kolibri DB structure. Putting them as html5 nodes would require duplicating the super-large perseus reader for each (to comply with kolibri's package policy). Putting them as exercise node is probably more realistic but will require tweaks to our exercise node handling. We should probably also check the implications of marking a node exercise at kolibri level.

So to recap:

test/fix sushi-chef TSV
upload kolibri channels for EN and FR
integrate articles into sushi-chef and kolibri2zim
tweak sushi-chef for AR and upload to studio

You can start with 1. We can run it on an online worker for large languages when you are satisfied with the sushi-chef and share the output folder.

imnitishng commented 3 years ago

Hi @rgaudin, I went through khan-chef in detail. There were a few minor issues I found that are fixed in the PR https://github.com/openzim/khan-chef/pull/1, I'll mention the issues here

The request logic seemed broken so every request was made 5 times regardless of the response status (fail or pass)
We will need a Youtube API key (or I can use mine) for the online worker to run, if the API key is not provided we get the subtitle and other metadata using youtube-dl which in turn uses https://api.proxyscrape.com/?request=getproxies to get proxies that is very slow, so I just removed the parameter that uses proxy, which means we will need YT API key as an environment variable
After we have fetched and parsed the data from TSV files, we create the JSON tree and then upload it to Kolibri Studio, now the channels in studio have a unique ID, this is generated using a combination UUID. uuid.uuid5(uuid.uuid5(uuid.NAMESPACE_DNS, self.source_domain), self.source_id), the source domain was always khanacademy.org which caused UUID collision with the existing khan academy channels, hence during the upload we run into permission issues. So I changed this to genrate a different UUID for our kolibri-zim channels.

Apart from this things look stable based on my manual testing on ja and fr languages. Obviously it takes a lot of time to complete the full run so I ran it on a subset of content that extracts ~10-15 video nodes and ~30 exercise nodes and uploads the to kolibri studio, the testing isn't ideal but yes we can now move to running this on the online worker in debug mode and look for any issues we run into when we try to extract full data. I am not sure which "online worker" we will be running this on, but yes I'll be happy to run and monitor the worker so that we can fix it (if any issue arise) or move on to step 3 if everything goes as planned.

rgaudin commented 3 years ago

@imnitishng thanks for all this. I wasn't notified of the PR. Looked at it and merged it. It's indeed better now. Thanks for the fix.

We have a few Youtube API keys on the Zimfarm already. We'll use them whenn we launch the full run on the zimfarm. From your comment, it seems that we are ready to test a full run at least to make sure it can complete a language like French. Upload will probably fail due to storage, but we should have an estimate of the required disk space. I will launch such a run on one of the Zimfarm workers (that's what I meant).

rgaudin commented 1 year ago

We have one ZIM for Khan EN (in dev, misnamed -fr) but fixed recipe and recipes for FR and ES have been created and will be tried soon.

Closing this ; improvements will be followed-up elsewhere.

openzim / zim-requests

Khan Academy #327

Current State

Problems

Proposal

Supporting multiple languages