Open RavanJAltaie opened 10 months ago
Clicking on a download link, you get an error message:
Sorry, the url https://coding-for-tomorrow.de/wp-content/uploads/2020/11/Informationen-zum-neuen-Online-Angebot-von-Coding-For-Tomorrow.pdf is not found on this server
As you can see, this URL doesn't share the same prefix as the URL of the recipe (https://coding-for-tomorrow.de/wp-content
is not within https://coding-for-tomorrow.de/downloads/
). You need to change the scope to allow scraping such URLs
@rgaudin I tried to change the scope to Any, page, prefix and still the resulted file is the same. Could you please let me know in scope parameter which one shall I use?
That's exactly why some documentation is needed. All those scopes have different effects.
You haven't tested Any
and that's good. I'd strongly advise against it as it would crawl anything. prefix
is the default and page
is somewhat similar.
I advise you try with host
(will grab anything under coding-for-tomorrow.de
) and see how that goes. I think often times, custom
is appropriate but it requires specifying includes
and excludes
which is very tedious.
There's no documentation on those scopes ; code is at https://github.com/webrecorder/browsertrix-crawler/blob/165a9787af8a7dce6b0acb5f91e6803ef525fd5b/util/seeds.js#L75
I tried changing the scopes, the host scraped the website but without the needed projects can you check please? https://farm.openzim.org/recipes/codingfortomorrow_de_all
I disabled the recipe and marked the resulted file for deletion
Now that the URL configured is https://coding-for-tomorrow.de, what did you expected by changing the scope from the default (prefix) to host?
prefix
scope will retrieve everything in the same directory so everything in https://coding-for-tomorrow.dehost
scope will retrieve everything on the same host so everything in https://coding-for-tomorrow.deI don't get what you expected by making this change.
That being said, I analyzed a bit the issue:
All that being said, as you see there is a significant effort needed by a developer to make the scraping of this website be enhanced, and I'm not even sure it will succeed (at least there is a significant chance that stuff like the Youtube videos will not be available).
@Popolechien what are your views on this, do you think this is worth the effort?
It's in German, not a core target audience. We can drop it I think.
The issue related is marked as upstream https://github.com/openzim/zim-requests/issues/460
Let's keep this issue open, I doubt we will make any progress in the coming months due to lack of resources but the ZIM request is legit, I've identified a potential solution and we should fix this at some point, it is not purely impossible or an immense effort, just not a priority for now.
The recipe of coding for tomorrow has been successful but the file in dev library is incomplete, all internal links are not clickable. https://farm.openzim.org/recipes/codingfortomorrow_de_all
https://dev.library.kiwix.org/viewer#codingfortomorrow_de_all_2023-08/A/coding-for-tomorrow.de/downloads/
Can you check please?