openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
42 stars 3 forks source link

Khan Academy #327

Closed Popolechien closed 1 year ago

Popolechien commented 3 years ago

See also https://github.com/rand-net/khan-dl

iharshit009 commented 3 years ago

Hi, I would like to work on this issue. Could you please elaborate on it, how do I need to proceed?

saumyaborwankar commented 3 years ago

Hello everyone, I got a hold of what needs to be done. Let me know if this is the right direction. So I have familiarized myself with the codebase for https://github.com/openzim/youtube Does Kiwix wants to do the same for Khan Academy videos?

kelson42 commented 3 years ago

The plan is to prepare the zim with kolibri2zim, therefore the challenge is to build a kolibri sushi chef recipe able to import the videos in all languages in kolibri.

saumyaborwankar commented 3 years ago

So just for my understanding, the plan is to make recipe like https://github.com/learningequality/sushi-chef-pradigi which can sync all the videos of khan academy to kolibri and kolibri2zim will make the zim file for the required course?

kelson42 commented 3 years ago

@saumyaborwankar Exact

rgaudin commented 3 years ago

No, as clearly stated on the gsoc page, the ultimate goal is to produce ZIM files for all Khan Academy languages.

Obviously, we don't want to start from scratch if we can leverage an existing project. We know there are offline packages of Khan Academy but we don't know them correctly. The first step is thus to find about all of those and document their strengths and limits ; including whether those support all KA languages or not (in the past, different languages were based on different platforms but that might have changed). We know that Kolibri have a few Khan Academy channels and we already have a kolibri2zim tool (although very new) but at this stage, we're not locked into the Kolibri scenario.

shashwat1002 commented 3 years ago

I had a few questions:

rgaudin commented 3 years ago

@shashwat1002, I have no strong opinion on this. I believe ZIM files should mimic the online version if possible. So Indian version might be by grades and other language by topics. But as long as it's easily accessible, I don't think that matters much.

No we want to cover as much as possible. ZIMs are offline and stateless so we can't save progression for instance. Even cookies are to be limited as users frequently share browsers. But we do want to support everything that can be with a reasonable effort.

Absolutely, videos are to be included as well.

TheFenrisLycaon commented 3 years ago

Is this still open to work?

rgaudin commented 3 years ago

Yes

RafayGhafoor commented 3 years ago

@rgaudin, What's the requirement to get participated in this project for GSoC, any mini-task that needs to be done?

TheFenrisLycaon commented 3 years ago

I'd like to take up this on for GSoC. Let me know how do we proceed.

rgaudin commented 3 years ago

Just follow the GSOC procedure for now

oshin94 commented 3 years ago

I too am interested to work on this.. Additionally, I would like to contribute to the Python Scrapper for NPTEL/SWAYAM. Both NPTEL and SWAYAM are Indian Massive open online course platform where courses are taught by the finest teachers from some of the best colleges/Universities in India. Having an offline version of such content will definitely help people in rural areas gain access to education and knowledge.

rgaudin commented 3 years ago

@imnitishng would you mind sharing the content of your GSOC proposal here? I think it's valuable information. If you could copy/paste here the main part, that'd be great.

Here's how I see it coming from there:

So, my suggestion:

@Popolechien we need a list of deliverables for this ticket. As pointed in the proposal, LE's version doesn't support all languages and there are some specificity (regional curriculum). We need to know exactly which ones to build and if not supported by LE's version, we should understand why.

imnitishng commented 3 years ago

Full Proposal

Current State

Problems

Proposal

Khan Academy has these TSV files hosted on Google Cloud they are provided by to their community partners, mostly used by translation teams to know which videos and exercises are translated, they seem to get updated almost every 4-5 days for each language, we would not have to worry about outdated content and they are available for every language.

The TSV (Tab-separated values) consists of rows that represent an entity from Khan Academy’s website. These entities can belong to one of the following values

Domain
Course 
Unit 
Lesson
Video
Exercise
Article
Interactive
Talkthrough
Challenge
Project

The TSV files are going to be our source of all the information, so the idea is to parse the TSV files into a topic tree. Such a format will ensure that every element will be downloaded and stored in the order that they are present on the source.

The tree would start from a dummy empty Node which expands downwards storing the information about different entities from the TSV files, all the cleaning and sanity checks for the input values will be performed in the tree-building process to avoid future problems. Once we have our tree we will start recursively iterating the tree in a Depth-First or Top-bottom fashion. For every node we encounter, we will run the respective functions to fetch all the relevant data from the nodes and render the HTML ultimately adding it to the ZIM file. Sample Tree structure:

image

imnitishng commented 3 years ago

Here are the Khan Academy channels available on Kolibri Studio

NAME - ID - RESOURCE COUNT - SIZE
Khan Academy (en - English - Standardized Test Preparation) - 6616efc8aa604a308c8f5d18b00a1ce3 - 2,539  10 GB
Khan Academy (en - English - US curriculum) - c9d7f950ab6b5a1199e3d6c10d7f0103 - 9,971  54 GB
Khan Academy (en - English - CBSE India Curriculum) - 2fd54ca47a8f59c99fcefaaa3894c19e - 11,066 61 GB
Khan Academy (es - Español) - c1f2b7e6ac9f56a2bb44fa7a48b66dce - 6,550  27 GB
Khan Academy (my - Burmese) - 2b608c6fd4c35c34b7387e3dd7b53265 - 185    236 MB
Khan Academy (bg - Bulgarian) - 09ee940e106953a2b6716e1020a0ce3f - 6,018    25 GB
Khan Academy (gu - Gujrati) - 5357e52581c3567da4f56d56badfeac7 - 2,740  5 GB
Khan Academy (it - Italiano) - 801a5f02942055698918edcff6494185 - 1,296 4 GB
Khan Academy (fr - Français) - 878ec2e6f88c5c268b1be6f202833cd4 - 4,347 13 GB
Khan Academy (bn - Bengali) - a03496a6de095e7ba9d24291a487c78d - 1,536  2 GB
Khan Academy (hi - Hindi) - a53592c972a8594e9b695aa127493ff6 - 1,366    2 GB
Khan Academy (ki - Kiswahili) - ec164fee25ee526296e68f7c10b1e169 - 1,421    2 GB
Khan Academy (zh - Chinese) - ec599e77f9ad58028975e8a26e6f1821 - 1,557  7 GB
Khan Academy (km - Khmer) - f5b71417b1f657fca4d1aaecd23e4067 - 506  948 MB
Khan Academy (pt - Português (Brasil)) - 2ac071c4672354f2aa78953448f81e50 - 6,815   27 GB
Khan Academy (pt-pt - Português (Portugal)) - c3231d844f8d5bb1b4cbc6a7ddd91eb7 - 1,445  1 GB
imnitishng commented 3 years ago

your proposal is not clear whether on the methodology. Are you proposing to start a sushi chef from scratch? I'd advise against that at this stage as LE's one seems to include a lot of work already. They do mention the CSV method so maybe we should work off that.

Initially I thought of creating a separate scraper for KA just like youtube so the proposal is based on that approach. I do plan to work on top of what's already done in LE sushi chef, we can borrow a lot from there. I'd like to ask whether we should create a new scraper for Khan Academy or work on creating a recipe for kolibri and improve kolibri2zim to support Khan academy and all other content offered by different channels. Creating a recipe and importing it to kolibri then scraping that to ZIM files feels like an extra step but yes it does serve a broader purpose of supporting other kolibri channels.

Could you list the limits/issues that LE's sushi-chef has regarding our objective?

When I started the kolibri2zim thing I couldn't easily extract a standalone version of perseus to bundle in kolibri2zim. Perseus is working in kolibri but it's intricate in its UI. Given perseus is abandonned, it's important we tackle this first.

Yes I do understand that this will require most of our focus. I'd like to get some starting points on this in the mean time I'll start exploring perseus and try figure out a way to make it work standalone.

rgaudin commented 3 years ago

Thank you @imnitishng for this lengthy response.

I've created a recipe for KA-fr first. There seem to be a minor issue that I'll fix and run it again.

Regarding the scraper, the less we have to maintain in the long term, the better. So unless there are issues with the sushi-chef preventing us from achieving what we want and that can't be overcome easily, I'd say we use it. The tsvkhan branch of course. The double step is not an issue. We can bundle both steps into a single container later on.

Regarding the language issue, am I right to understand that the Kolibri Studio platform doesn't allow setting language codes for all the languages KA is available in? Is that it?

So I'd advise you start by looking into the exercise issue: https://github.com/openzim/kolibri2zim/issues/37

jameelkaisar commented 3 years ago

I was not aware of TSV files. My approach (Prototype) was to directly scrape the content from KA website (very naive approach compared to @imnitishng 's approach).

TSV files contain most of the required information. I am of the opinion we should also add some additional information like comments (maybe at a later stage).

Perseus and TSV files are a good starting point. I think we should start by creating a renderer for extracting information from TSV files. Then we can start working on openzim/kolibri2zim#37.

rgaudin commented 3 years ago

TSV files contain most of the required information. I am of the opinion we should also add some additional information like comments (maybe at a later stage).

Yeah we shall start with core functionalities first.

I think we should start by creating a renderer for extracting information from TSV files. Then we can start working on openzim/kolibri2zim#37.

What do you mean by renderer? My understanding (although I have not tested it) is that the sushi-chef (tsv branch) already creates the Kolibri tree and channel from those. We already have the kolibri to Zim renderer (very basic, doesn't support exercises).

It's important we secure exercises first. Yet since you are both willing to work on this, one of you can work on the perseus issue and the other one look at what needs to be fixed/improved with the sushi-chef TSV branch.

I've forked the LE's repo into https://github.com/openzim/khan-chef.

Zimfarm recipe is now running. Issue was due known yet undeployed fixed issue (https://github.com/openzim/python-scraperlib/commit/0ca222ff8bea8b5ba474c501203f7ee00ed508b1)

imnitishng commented 3 years ago

Hi @rgaudin Now that we have exercise nodes support in kolibri2zim https://github.com/openzim/kolibri2zim/pull/38, we can move forward with this.

Although we haven't discussed this much, but based on our past discussion here, I'll share my understanding of work that needs to be done to achieve all the project objectives.

Supporting multiple languages

Moreover, we have KA articles that also use perseus rendering and they aren't supported in both sushi-chef-khan-academy and kolibri2zim, so this can be another objective for the project as well.

Let me know if you have anything more to add to this.

rgaudin commented 3 years ago

Indeed, sushi-chef merged the tsvkhan branch into master. We should do the same on our fork.

I don't think bundling both tools into one is pertinent now: the sushi-chef is probably fragile and will need some fixes so it's unlikely this will go smoothly all the way through a working ZIM file. The first objective is thus probably your next one: using the sushi-chef to build a kolibri channel. We can surely use the smallest available language for dev.

Our goal is to upload our channels to the official kolibri studio. We have a 30GB account there with 7GB used ATM. If we can create a good quality khan channel, we can surely request a storage upgrade to upload those that we target: EN, FR and AR.

Once we have the chef able to produce EN and FR, we'll have to look into making it work with AR which is not in the list of supported languages. I have no idea why, as the TSV do contain AR files.

Then, we'll need to integrate those articles. I understand those are quite important but we need to fit into the Kolibri DB structure. Putting them as html5 nodes would require duplicating the super-large perseus reader for each (to comply with kolibri's package policy). Putting them as exercise node is probably more realistic but will require tweaks to our exercise node handling. We should probably also check the implications of marking a node exercise at kolibri level.

So to recap:

  1. test/fix sushi-chef TSV
  2. upload kolibri channels for EN and FR
  3. integrate articles into sushi-chef and kolibri2zim
  4. tweak sushi-chef for AR and upload to studio

You can start with 1. We can run it on an online worker for large languages when you are satisfied with the sushi-chef and share the output folder.

imnitishng commented 3 years ago

Hi @rgaudin, I went through khan-chef in detail. There were a few minor issues I found that are fixed in the PR https://github.com/openzim/khan-chef/pull/1, I'll mention the issues here

Apart from this things look stable based on my manual testing on ja and fr languages. Obviously it takes a lot of time to complete the full run so I ran it on a subset of content that extracts ~10-15 video nodes and ~30 exercise nodes and uploads the to kolibri studio, the testing isn't ideal but yes we can now move to running this on the online worker in debug mode and look for any issues we run into when we try to extract full data. I am not sure which "online worker" we will be running this on, but yes I'll be happy to run and monitor the worker so that we can fix it (if any issue arise) or move on to step 3 if everything goes as planned.

rgaudin commented 3 years ago

@imnitishng thanks for all this. I wasn't notified of the PR. Looked at it and merged it. It's indeed better now. Thanks for the fix.

We have a few Youtube API keys on the Zimfarm already. We'll use them whenn we launch the full run on the zimfarm. From your comment, it seems that we are ready to test a full run at least to make sure it can complete a language like French. Upload will probably fail due to storage, but we should have an estimate of the required disk space. I will launch such a run on one of the Zimfarm workers (that's what I meant).

rgaudin commented 1 year ago

We have one ZIM for Khan EN (in dev, misnamed -fr) but fixed recipe and recipes for FR and ES have been created and will be tried soon.

Closing this ; improvements will be followed-up elsewhere.