Closed Popolechien closed 1 year ago
Hi, I would like to work on this issue. Could you please elaborate on it, how do I need to proceed?
Hello everyone, I got a hold of what needs to be done. Let me know if this is the right direction. So I have familiarized myself with the codebase for https://github.com/openzim/youtube Does Kiwix wants to do the same for Khan Academy videos?
The plan is to prepare the zim with kolibri2zim, therefore the challenge is to build a kolibri sushi chef recipe able to import the videos in all languages in kolibri.
So just for my understanding, the plan is to make recipe like https://github.com/learningequality/sushi-chef-pradigi which can sync all the videos of khan academy to kolibri and kolibri2zim will make the zim file for the required course?
@saumyaborwankar Exact
No, as clearly stated on the gsoc page, the ultimate goal is to produce ZIM files for all Khan Academy languages.
Obviously, we don't want to start from scratch if we can leverage an existing project. We know there are offline packages of Khan Academy but we don't know them correctly. The first step is thus to find about all of those and document their strengths and limits ; including whether those support all KA languages or not (in the past, different languages were based on different platforms but that might have changed). We know that Kolibri have a few Khan Academy channels and we already have a kolibri2zim tool (although very new) but at this stage, we're not locked into the Kolibri scenario.
I had a few questions:
@shashwat1002, I have no strong opinion on this. I believe ZIM files should mimic the online version if possible. So Indian version might be by grades and other language by topics. But as long as it's easily accessible, I don't think that matters much.
No we want to cover as much as possible. ZIMs are offline and stateless so we can't save progression for instance. Even cookies are to be limited as users frequently share browsers. But we do want to support everything that can be with a reasonable effort.
Absolutely, videos are to be included as well.
Is this still open to work?
Yes
@rgaudin, What's the requirement to get participated in this project for GSoC, any mini-task that needs to be done?
I'd like to take up this on for GSoC. Let me know how do we proceed.
Just follow the GSOC procedure for now
I too am interested to work on this.. Additionally, I would like to contribute to the Python Scrapper for NPTEL/SWAYAM. Both NPTEL and SWAYAM are Indian Massive open online course platform where courses are taught by the finest teachers from some of the best colleges/Universities in India. Having an offline version of such content will definitely help people in rural areas gain access to education and knowledge.
@imnitishng would you mind sharing the content of your GSOC proposal here? I think it's valuable information. If you could copy/paste here the main part, that'd be great.
Here's how I see it coming from there:
So, my suggestion:
@Popolechien we need a list of deliverables for this ticket. As pointed in the proposal, LE's version doesn't support all languages and there are some specificity (regional curriculum). We need to know exactly which ones to build and if not supported by LE's version, we should understand why.
Khan Academy has these TSV files hosted on Google Cloud they are provided by to their community partners, mostly used by translation teams to know which videos and exercises are translated, they seem to get updated almost every 4-5 days for each language, we would not have to worry about outdated content and they are available for every language.
The TSV (Tab-separated values) consists of rows that represent an entity from Khan Academy’s website. These entities can belong to one of the following values
Domain
Course
Unit
Lesson
Video
Exercise
Article
Interactive
Talkthrough
Challenge
Project
The TSV files are going to be our source of all the information, so the idea is to parse the TSV files into a topic tree. Such a format will ensure that every element will be downloaded and stored in the order that they are present on the source.
The tree would start from a dummy empty Node which expands downwards storing the information about different entities from the TSV files, all the cleaning and sanity checks for the input values will be performed in the tree-building process to avoid future problems. Once we have our tree we will start recursively iterating the tree in a Depth-First or Top-bottom fashion. For every node we encounter, we will run the respective functions to fetch all the relevant data from the nodes and render the HTML ultimately adding it to the ZIM file. Sample Tree structure:
Here are the Khan Academy channels available on Kolibri Studio
NAME - ID - RESOURCE COUNT - SIZE
Khan Academy (en - English - Standardized Test Preparation) - 6616efc8aa604a308c8f5d18b00a1ce3 - 2,539 10 GB
Khan Academy (en - English - US curriculum) - c9d7f950ab6b5a1199e3d6c10d7f0103 - 9,971 54 GB
Khan Academy (en - English - CBSE India Curriculum) - 2fd54ca47a8f59c99fcefaaa3894c19e - 11,066 61 GB
Khan Academy (es - Español) - c1f2b7e6ac9f56a2bb44fa7a48b66dce - 6,550 27 GB
Khan Academy (my - Burmese) - 2b608c6fd4c35c34b7387e3dd7b53265 - 185 236 MB
Khan Academy (bg - Bulgarian) - 09ee940e106953a2b6716e1020a0ce3f - 6,018 25 GB
Khan Academy (gu - Gujrati) - 5357e52581c3567da4f56d56badfeac7 - 2,740 5 GB
Khan Academy (it - Italiano) - 801a5f02942055698918edcff6494185 - 1,296 4 GB
Khan Academy (fr - Français) - 878ec2e6f88c5c268b1be6f202833cd4 - 4,347 13 GB
Khan Academy (bn - Bengali) - a03496a6de095e7ba9d24291a487c78d - 1,536 2 GB
Khan Academy (hi - Hindi) - a53592c972a8594e9b695aa127493ff6 - 1,366 2 GB
Khan Academy (ki - Kiswahili) - ec164fee25ee526296e68f7c10b1e169 - 1,421 2 GB
Khan Academy (zh - Chinese) - ec599e77f9ad58028975e8a26e6f1821 - 1,557 7 GB
Khan Academy (km - Khmer) - f5b71417b1f657fca4d1aaecd23e4067 - 506 948 MB
Khan Academy (pt - Português (Brasil)) - 2ac071c4672354f2aa78953448f81e50 - 6,815 27 GB
Khan Academy (pt-pt - Português (Portugal)) - c3231d844f8d5bb1b4cbc6a7ddd91eb7 - 1,445 1 GB
your proposal is not clear whether on the methodology. Are you proposing to start a sushi chef from scratch? I'd advise against that at this stage as LE's one seems to include a lot of work already. They do mention the CSV method so maybe we should work off that.
Initially I thought of creating a separate scraper for KA just like youtube so the proposal is based on that approach. I do plan to work on top of what's already done in LE sushi chef, we can borrow a lot from there. I'd like to ask whether we should create a new scraper for Khan Academy or work on creating a recipe for kolibri and improve kolibri2zim to support Khan academy and all other content offered by different channels. Creating a recipe and importing it to kolibri then scraping that to ZIM files feels like an extra step but yes it does serve a broader purpose of supporting other kolibri channels.
Could you list the limits/issues that LE's sushi-chef has regarding our objective?
When I started the kolibri2zim thing I couldn't easily extract a standalone version of perseus to bundle in kolibri2zim. Perseus is working in kolibri but it's intricate in its UI. Given perseus is abandonned, it's important we tackle this first.
Yes I do understand that this will require most of our focus. I'd like to get some starting points on this in the mean time I'll start exploring perseus and try figure out a way to make it work standalone.
Thank you @imnitishng for this lengthy response.
I've created a recipe for KA-fr first. There seem to be a minor issue that I'll fix and run it again.
Regarding the scraper, the less we have to maintain in the long term, the better. So unless there are issues with the sushi-chef preventing us from achieving what we want and that can't be overcome easily, I'd say we use it. The tsvkhan branch of course. The double step is not an issue. We can bundle both steps into a single container later on.
Regarding the language issue, am I right to understand that the Kolibri Studio platform doesn't allow setting language codes for all the languages KA is available in? Is that it?
So I'd advise you start by looking into the exercise issue: https://github.com/openzim/kolibri2zim/issues/37
I was not aware of TSV files. My approach (Prototype) was to directly scrape the content from KA website (very naive approach compared to @imnitishng 's approach).
TSV files contain most of the required information. I am of the opinion we should also add some additional information like comments (maybe at a later stage).
Perseus and TSV files are a good starting point. I think we should start by creating a renderer for extracting information from TSV files. Then we can start working on openzim/kolibri2zim#37.
TSV files contain most of the required information. I am of the opinion we should also add some additional information like comments (maybe at a later stage).
Yeah we shall start with core functionalities first.
I think we should start by creating a renderer for extracting information from TSV files. Then we can start working on openzim/kolibri2zim#37.
What do you mean by renderer? My understanding (although I have not tested it) is that the sushi-chef (tsv branch) already creates the Kolibri tree and channel from those. We already have the kolibri to Zim renderer (very basic, doesn't support exercises).
It's important we secure exercises first. Yet since you are both willing to work on this, one of you can work on the perseus issue and the other one look at what needs to be fixed/improved with the sushi-chef TSV branch.
I've forked the LE's repo into https://github.com/openzim/khan-chef.
Zimfarm recipe is now running. Issue was due known yet undeployed fixed issue (https://github.com/openzim/python-scraperlib/commit/0ca222ff8bea8b5ba474c501203f7ee00ed508b1)
Hi @rgaudin
Now that we have exercise nodes support in kolibri2zim
https://github.com/openzim/kolibri2zim/pull/38, we can move forward with this.
Although we haven't discussed this much, but based on our past discussion here, I'll share my understanding of work that needs to be done to achieve all the project objectives.
kolibri
content, hence we can bundle this in a container alongwith kolibri2zim
to generate the ZIM files for different languages, but it takes a lot of time to build these (due to youtube API for subtitles, thumbnails etc). So maybe I'll have modify it a bit to run faster locally (for testing purposes), or use a language with less content to fetch.sushi-chef
works for our use case, we'll be creating kolibri
channels out of them. So since currently kolibri2zim
uses channels from https://studio.learningequality.org, I wanted to ask where we would be storing the channels that we create. Moreover, we have KA articles that also use perseus
rendering and they aren't supported in both sushi-chef-khan-academy
and kolibri2zim
, so this can be another objective for the project as well.
Let me know if you have anything more to add to this.
Indeed, sushi-chef merged the tsvkhan branch into master. We should do the same on our fork.
I don't think bundling both tools into one is pertinent now: the sushi-chef is probably fragile and will need some fixes so it's unlikely this will go smoothly all the way through a working ZIM file. The first objective is thus probably your next one: using the sushi-chef to build a kolibri channel. We can surely use the smallest available language for dev.
Our goal is to upload our channels to the official kolibri studio. We have a 30GB account there with 7GB used ATM. If we can create a good quality khan channel, we can surely request a storage upgrade to upload those that we target: EN, FR and AR.
Once we have the chef able to produce EN and FR, we'll have to look into making it work with AR which is not in the list of supported languages. I have no idea why, as the TSV do contain AR files.
Then, we'll need to integrate those articles. I understand those are quite important but we need to fit into the Kolibri DB structure. Putting them as html5
nodes would require duplicating the super-large perseus reader for each (to comply with kolibri's package policy). Putting them as exercise
node is probably more realistic but will require tweaks to our exercise node handling. We should probably also check the implications of marking a node exercise
at kolibri level.
So to recap:
You can start with 1. We can run it on an online worker for large languages when you are satisfied with the sushi-chef and share the output folder.
Hi @rgaudin, I went through khan-chef in detail. There were a few minor issues I found that are fixed in the PR https://github.com/openzim/khan-chef/pull/1, I'll mention the issues here
youtube-dl
which in turn uses https://api.proxyscrape.com/?request=getproxies to get proxies that is very slow, so I just removed the parameter that uses proxy, which means we will need YT API key as an environment variableuuid.uuid5(uuid.uuid5(uuid.NAMESPACE_DNS, self.source_domain), self.source_id)
, the source domain was always khanacademy.org
which caused UUID collision with the existing khan academy channels, hence during the upload we run into permission issues. So I changed this to genrate a different UUID for our kolibri-zim
channels.Apart from this things look stable based on my manual testing on ja
and fr
languages. Obviously it takes a lot of time to complete the full run so I ran it on a subset of content that extracts ~10-15 video nodes and ~30 exercise nodes and uploads the to kolibri studio, the testing isn't ideal but yes we can now move to running this on the online worker in debug mode and look for any issues we run into when we try to extract full data.
I am not sure which "online worker" we will be running this on, but yes I'll be happy to run and monitor the worker so that we can fix it (if any issue arise) or move on to step 3 if everything goes as planned.
@imnitishng thanks for all this. I wasn't notified of the PR. Looked at it and merged it. It's indeed better now. Thanks for the fix.
We have a few Youtube API keys on the Zimfarm already. We'll use them whenn we launch the full run on the zimfarm. From your comment, it seems that we are ready to test a full run at least to make sure it can complete a language like French. Upload will probably fail due to storage, but we should have an estimate of the required disk space. I will launch such a run on one of the Zimfarm workers (that's what I meant).
We have one ZIM for Khan EN (in dev, misnamed -fr
) but fixed recipe and recipes for FR and ES have been created and will be tried soon.
Closing this ; improvements will be followed-up elsewhere.
See also https://github.com/rand-net/khan-dl