openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
284 stars 23 forks source link

Coding for tommorrow incomplete file #215

Open RavanJAltaie opened 10 months ago

RavanJAltaie commented 10 months ago

The recipe of coding for tomorrow has been successful but the file in dev library is incomplete, all internal links are not clickable. https://farm.openzim.org/recipes/codingfortomorrow_de_all

https://dev.library.kiwix.org/viewer#codingfortomorrow_de_all_2023-08/A/coding-for-tomorrow.de/downloads/

Can you check please?

rgaudin commented 10 months ago

Clicking on a download link, you get an error message:

Sorry, the url https://coding-for-tomorrow.de/wp-content/uploads/2020/11/Informationen-zum-neuen-Online-Angebot-von-Coding-For-Tomorrow.pdf is not found on this server

As you can see, this URL doesn't share the same prefix as the URL of the recipe (https://coding-for-tomorrow.de/wp-content is not within https://coding-for-tomorrow.de/downloads/). You need to change the scope to allow scraping such URLs

RavanJAltaie commented 9 months ago

@rgaudin I tried to change the scope to Any, page, prefix and still the resulted file is the same. Could you please let me know in scope parameter which one shall I use?

rgaudin commented 9 months ago

That's exactly why some documentation is needed. All those scopes have different effects.

You haven't tested Any and that's good. I'd strongly advise against it as it would crawl anything. prefix is the default and page is somewhat similar.

I advise you try with host (will grab anything under coding-for-tomorrow.de) and see how that goes. I think often times, custom is appropriate but it requires specifying includes and excludes which is very tedious.

There's no documentation on those scopes ; code is at https://github.com/webrecorder/browsertrix-crawler/blob/165a9787af8a7dce6b0acb5f91e6803ef525fd5b/util/seeds.js#L75

RavanJAltaie commented 7 months ago

I tried changing the scopes, the host scraped the website but without the needed projects can you check please? https://farm.openzim.org/recipes/codingfortomorrow_de_all

I disabled the recipe and marked the resulted file for deletion

benoit74 commented 7 months ago

Now that the URL configured is https://coding-for-tomorrow.de, what did you expected by changing the scope from the default (prefix) to host?

I don't get what you expected by making this change.

That being said, I analyzed a bit the issue:

All that being said, as you see there is a significant effort needed by a developer to make the scraping of this website be enhanced, and I'm not even sure it will succeed (at least there is a significant chance that stuff like the Youtube videos will not be available).

@Popolechien what are your views on this, do you think this is worth the effort?

Popolechien commented 7 months ago

It's in German, not a core target audience. We can drop it I think.

RavanJAltaie commented 5 months ago

The issue related is marked as upstream https://github.com/openzim/zim-requests/issues/460

benoit74 commented 5 months ago

Let's keep this issue open, I doubt we will make any progress in the coming months due to lack of resources but the ZIM request is legit, I've identified a potential solution and we should fix this at some point, it is not purely impossible or an immense effort, just not a priority for now.