openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
359 stars 25 forks source link

Video on kiwix.org homepage is not retrieved #247

Closed benoit74 closed 5 months ago

benoit74 commented 1 year ago

Zimit version: 1.6.2 (not yet released, just to have the fix for --depth 0 + crawler 0.12.2)

While doing a ZIM of https://kiwix.org, the Youtube video on the home page is not present in the ZIM

How to reproduce:

zimit --url="https://kiwix.org/fr/" --depth 0 --keep --name kiwix_org 

Activating all behaviors does not help:

zimit --url="https://kiwix.org/fr/" --depth 0 --keep --name kiwix_org --behaviors autoscroll,autoplay,autofetch,siteSpecific 

I had a look at the WARCs content and the request to Youtube was not made.

Running only the crawler with official 0.12.2 image does not help (Youtube video is still not in the WARC):

crawl --depth 0 --url https://kiwix.org/fr/ --cwd /output/.tmph919m5n3

I'm going to open an upstream ticket

Jaifroid commented 12 months ago

Is the scope correctly set? Because that video is from YouTube rather than kiwix.org, navigation to it might be blocked.

rgaudin commented 12 months ago

Might be because this is not a regular <video /> embed but an <iframe />. I believe browsertrix considers those resources (and thus not subject to scoping) but it's worth checking if there's no request to YT.

benoit74 commented 12 months ago

I tried many scopes, including a custom one with both youtube.com and kiwix.org domains included. Might be the <iframe /> which is the issue, you are right. Or the fact that one has to click on the button to make the iframe appear and load the iframe into the DOM (before that the video URL is only in the data-video attribute of an img. How do you wanna check if there is no request to YT? I already checked in the WARCs and there is no request to YT.

rgaudin commented 12 months ago

I didn't realize a click was needed to create the iframe on DOM. That's definitely the issue. This is not standard YT behavior and certainly not handled in browsertrix. We need to emulate that click…

As for network requests, all requests goes through pywb (set as proxy). Maybe there's a flag/env for pywb to print requests?

It that's useful enough for debugging, we could also imagine embedding a script in zimit that's just conditionnaly print/record requests and forwards them to pywb. We'd set is as the proxy

benoit74 commented 12 months ago

We need to emulate that click…

How do you do that? With a custom behavior?

Maybe there's a flag/env for pywb to print requests?

How would that be different from WARCs content? It is already quite straightforward to display all requests stored in WARC files, so if there is no difference I would rather add a warc2zim flag to display all requests found in WARCs while processing them.

rgaudin commented 12 months ago

How do you do that? With a custom behavior?

I don't know.

tw4l commented 11 months ago

How do you do that? With a custom behavior?

I don't know.

Yeah this looks like a pretty classic case for a custom behavior! We have a new Tutorial on how to create them, it'd be great to see if it's useful and get any feedback on it :) https://github.com/webrecorder/browsertrix-behaviors/blob/main/docs/TUTORIAL.md

benoit74 commented 5 months ago

This is not a scraper issue, so closing this, we have to develop the custom behavior if we really want to make it into the ZIM, that's a "content team" issue them.

kelson42 commented 5 months ago

I'm a bit puzzled, what is "special" on kiwix.org web site? Standart CMS + standart video platform!

kelson42 commented 5 months ago

I reopen the issue just to be sure I get it right.

rgaudin commented 5 months ago

It's not standard, the video is not on the page, it's an iframe that is injected in a popup that is displayed upon click on a button

benoit74 commented 5 months ago

Hence the need for a custom behavior to simulate the user click on the button.

Shall we close this again? (there is nothing to do on the scraper side, it is just a customization needed for this particular side which can be done without modifying the scarper at all, custom behaviors are just a JS file than can be injected on the CLI)

kelson42 commented 5 months ago

Actually it's still closed!