Closed benoit74 closed 5 months ago
Is the scope correctly set? Because that video is from YouTube rather than kiwix.org, navigation to it might be blocked.
Might be because this is not a regular <video />
embed but an <iframe />
. I believe browsertrix considers those resources (and thus not subject to scoping) but it's worth checking if there's no request to YT.
I tried many scopes, including a custom one with both youtube.com and kiwix.org domains included.
Might be the <iframe />
which is the issue, you are right. Or the fact that one has to click on the button to make the iframe appear and load the iframe into the DOM (before that the video URL is only in the data-video
attribute of an img
.
How do you wanna check if there is no request to YT? I already checked in the WARCs and there is no request to YT.
I didn't realize a click was needed to create the iframe on DOM. That's definitely the issue. This is not standard YT behavior and certainly not handled in browsertrix. We need to emulate that click…
As for network requests, all requests goes through pywb (set as proxy). Maybe there's a flag/env for pywb to print requests?
It that's useful enough for debugging, we could also imagine embedding a script in zimit that's just conditionnaly print/record requests and forwards them to pywb. We'd set is as the proxy
We need to emulate that click…
How do you do that? With a custom behavior?
Maybe there's a flag/env for pywb to print requests?
How would that be different from WARCs content? It is already quite straightforward to display all requests stored in WARC files, so if there is no difference I would rather add a warc2zim flag to display all requests found in WARCs while processing them.
How do you do that? With a custom behavior?
I don't know.
How do you do that? With a custom behavior?
I don't know.
Yeah this looks like a pretty classic case for a custom behavior! We have a new Tutorial on how to create them, it'd be great to see if it's useful and get any feedback on it :) https://github.com/webrecorder/browsertrix-behaviors/blob/main/docs/TUTORIAL.md
This is not a scraper issue, so closing this, we have to develop the custom behavior if we really want to make it into the ZIM, that's a "content team" issue them.
I'm a bit puzzled, what is "special" on kiwix.org web site? Standart CMS + standart video platform!
I reopen the issue just to be sure I get it right.
It's not standard, the video is not on the page, it's an iframe that is injected in a popup that is displayed upon click on a button
Hence the need for a custom behavior to simulate the user click on the button.
Shall we close this again? (there is nothing to do on the scraper side, it is just a customization needed for this particular side which can be done without modifying the scarper at all, custom behaviors are just a JS file than can be injected on the CLI)
Actually it's still closed!
Zimit version: 1.6.2 (not yet released, just to have the fix for
--depth 0
+ crawler 0.12.2)While doing a ZIM of https://kiwix.org, the Youtube video on the home page is not present in the ZIM
How to reproduce:
Activating all behaviors does not help:
I had a look at the WARCs content and the request to Youtube was not made.
Running only the crawler with official 0.12.2 image does not help (Youtube video is still not in the WARC):
I'm going to open an upstream ticket