openzim / freecodecamp

FreeCodeCamp.org scraper (to ZIM)
GNU General Public License v3.0
4 stars 2 forks source link

freeCodeCamp builder #5

Closed mdp closed 1 year ago

mdp commented 1 year ago

This is now working and includes:

mdp commented 1 year ago

Thanks for the feedback. Most of this is pretty easy to address but I have a question around the structuring of the client app and content.

Currently we have a Vuejs client that takes in a URL fragment, which is maps to course content and pulls it in the background on a change. With regards to the fragment, can we fix this we query params? Something like index.html?course=english%2FTregular-expressions%2Fextract-matches. Would it then be possible to use a redirect? Unfortunatley Vue routing doesn't really provide an option for building out a single page application for each Challenge ("english/regular-expressions/extract-matches/index.html"), which would be pretty redundant, but offers either a hash mode, or HTML5 mode (which requires a catch-all route to the SPA index.html - https://router.vuejs.org/guide/essentials/history-mode.html#html5-mode). I'm pretty sure using a catch-all is not an option in Libzim.

I'll start on the rest of the feedback today.

rgaudin commented 1 year ago

I'm pretty sure using a catch-all is not an option in Libzim.

Indeed it wouldn't work. The query-string instead of the fragment would suffer from the same limitation.

I don't see any way to do it using a ZIM redirect because those are ZIM-entry to ZIM-entry redirect and no request context is sent to the target.

I think the easiest solution, which is not the prettiest but is one we've followed in the past, is to create stub ZIM entries for each course (or curriculum – what did you have in mind?) Entry would be a minimal HTML that simply declares a redirect (HTTP one this time) – via meta/JS – to the fragment-based URL of the content.

This would allow those entries to be served by the Suggestion feature, as well as the Random one. We could even (in a later optimization), give the text in the corresponding markdown to xapian so that those entries are returned by the full text engine.

mdp commented 1 year ago

I think this in decent shape for another review when you get the chance @rgaudin

Fixed/Added:

Outstanding items:

Thanks

rgaudin commented 1 year ago
  • FCC has chinese and chineses-traditional which would normally map to zh-cn and zh-tw respectively, but not sure about this with 3 lettter ISO.

I have zero knowledge of the Chinese languages so this may be completely off but from Wikipedia and Etnologue, I'd map chinese=cmn and chinese-traditional=lzh.

So:

FCCLangMap = {
    "cmn": "chinese",
    "lzh": "chinese-traditional",
    ...
}
kelson42 commented 1 year ago

@mdp I guess it would be better to have a lock file for js deps. Otherwise no garanty that we get the same result by running twice the scraper.

mdp commented 1 year ago

@mdp I guess it would be better to have a lock file for js deps. Otherwise no garanty that we get the same result by running twice the scraper.

I'm using 'yarn' with the --frozen-lockfile in the latest Dockefile, but let me know if that's not working. Unless something else is happening at another level, the Vue application being built with Vite should be reproducible as well. I verfied it by deleting the dependencies, reinstalling from the lockfile and doing a shasum on the resulting build and it came out the same.

mdp commented 1 year ago

@rgaudin I think there's just the remainin question about is_front and being able to search for Challenges, but all the other items should be resolved. Thanks for all your help on this. It's been several years since I've done Python (and then very little), sorry for the less than Pythonic code.

rgaudin commented 1 year ago

Just saw another glitch ; in python-for-everybody, a number of challenges like networking-text-processing have titles that include a colon : which breaks the YAML parsing.

See /fcc/curriculum/english/scientific-computing-with-python/python-for-everybody/networking-text-processing.md:

---
id: 5e7b9f0c0b6c005b0e76f074
title: 'Networking: Text Processing'
challengeType: 11
videoId: Pv_pJgVu8WI
bilibiliIds:
  aid: 804442498
  bvid: BV16y4y1j7WW
  cid: 377329124
dashedName: networking-text-processing
---

The rendered title is 'Networking

Screenshot 2023-07-04 at 09 44 45
benoit74 commented 1 year ago

I would like to thank you @mdp, this is really great work I think. The last mile seems almost here now.

I appreciate especially the fact that this scrapper proves it is totally feasible / easy to host a Vite (Vue.JS here, but probably does not matter much indeed) application inside a Zim. We briefly discussed about it with @kelson42 and @rgaudin few days ago for the UI enhancements of kolibri, for which I was tempted to migrate to Vue.JS for simpler/smaller UI code once the UI becomes a bit dynamic/complex. I believe that you gave a good answer to my interrogations ^^

mdp commented 1 year ago

Sorry, just now getting back to this. Tackling the current build break, which looks like a minor directory issue.

mdp commented 1 year ago

@rgaudin With regards to the colon issue I need to play around with Python's YAML parsing and figure out if there's an easy fix on this one. I'll try to have this one fixed today as well as the tests.

rgaudin commented 1 year ago

@rgaudin With regards to the colon issue I need to play around with Python's YAML parsing and figure out if there's an easy fix on this one. I'll try to have this one fixed today as well as the tests.

Yeah I want to merge this now (as soon as I fix the tests) and open a ticket for the colon issue. Does that work for you?

mdp commented 1 year ago

I just pushed a build fix. Merge sounds good. I'll start on the colon issue now.