openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
42 stars 4 forks source link

[zimit2] Dynamically generated content derived from JSON not always showing in zimit2 version of MDN #181

Closed Jaifroid closed 5 months ago

Jaifroid commented 6 months ago

Comparing the zimit2 run of MDN (https://dev.library.kiwix.org/viewer#developer.mozilla.org_en_all_2024-02/) and last month's zimit classic run (https://library.kiwix.org/viewer#developer.mozilla.org_en_all_2024-01), some of the dynamically generated information, specifically the browser compatibility tables at the end of each JS API article, are missing in the zimit2 version. Please see the screenshot below. Although the screenshot uses the PWA, I have verified, and you can check, the same issue on Kiwix Serve by comparing these two links and scrolling to the bottom of each page: zimit1 showOPenFilePicker article -> zimit2 showOpenFilePicker article.

image

Looking at console.log, it seems the error is caused by this JSON.parse() exception. Possibly a Wombat-generated URL that is not correctly interpreted?

image

Jaifroid commented 6 months ago

This issue is probably related to #160.

Jaifroid commented 6 months ago

I've found a number of pages where the Browser Compatibility tables do show correctly, e.g. RegExp Javascript. And equally a number of pages where they don't, another example: Perfomance Web API. So this must depend on whether the JSON(P) data contain something that the Python rewriter misses, or whether the JSON data are static or maybe contain dynamic content (i.e. JSONP).

It seems a bit random. I guess we can't rule out incomplete scraping without more data points.

benoit74 commented 6 months ago

I've investigated a bit more and what I've found seems to indicate that this is more a JS issue than a JSON issue (BCD below means "Browser Compatiblity Data").

I've reused the showOpenFilePicker (not working) and the RegExp (working) examples and I've compared live with zimit2 URLs.

It looks like BCD has been scraped properly and is present inside the ZIM in both cases, but for some reason there is a JS error for showOpenFilePicker in Zimit2 which prevent the display of BCD data (the JSON file containing BCD data is not even fetched).

Jaifroid commented 6 months ago

OK, thanks for the deep dive. Possibly the JSON.parse() error in console.log (which, based on random sampling, seems to occur whenever a page is affected), is a symptom rather than a cause. It's trying to parse something but finds invalid JSON data (including variables) where it was expecting properly formatted JSON.

Speculation EDIT: THIS IS NOT THE CASE HERE, SEE NEXT COMMENT:

Maybe the JSON.parss() routine has received an HTML (possibly a 302) document in response to whatever JSON request was made. This is something I saw a lot when making my own custom reader (without the Replay system) for zimit1 back in the day: I would request an image, and I'd get html (because I was being sent a 302 or a 404 page), which complicated code a lot, as I had to follow the redirects before serving the content. Same with PDFs in some cases: ready.gov ZIMs, for example, contain some "links" to PDFs that in fact return HTML documents that are redirects to where the PDF file actually is. If you don't follow that in the backend, you get garbage or a false "not found" (see https://github.com/openzim/warc2zim/issues/178#issuecomment-1937914479).

As I say, it's speculation, and may well not be the specific case here. As they say, eliminate the easy explanations first.

Jaifroid commented 6 months ago

OK, here's what's actually happening. It's rather curious:

  1. The affected articles contain a <script> block with id hydration (see first screenshot). This block is used to populate at least the framework if not the content of the BCD tables. Note, as you can see from the screenshot, that Wombat has added a bunch of variables at the top of the script block.
  2. Part of a script is in charge of reading the hydration block as JSON (screenshot 2 below). It gets the hydration element, grabs its textContent, and parses it as JSON. Unfortunately, it has also grabbed the Wombat code as textContent, and attempting to parse this as JSON naturally fails.
  3. Further in the same script, we find the message that appears in the browser "BCD tables only load in the browser" (screenshot 3). As the file is minified, it is difficult to work out the precise relationship between the JSON.parse() error and this function deciding it cannot show the BCD tables.

It would be interesting to compare why the same procedure doesn't cause problems in the Replay-controlled version (zimit1). Surely Wombat should patch .textContent to ignore its own script in both versions. It seems that sometimes this happens and sometimes it doesn't, though it doesn't look like an obvious race condition since it always seems to affect the same documents. This doesn't seem like it has anything to do with the Service Worker.

image

image

image

Jaifroid commented 6 months ago

Furhter info: the Browser Commpaitiblity Data are interpreted in the following webpack chunked file in the January ZIM:

C/A/developer.mozilla.org/static/js/browser-compatibility-table.c203ab60.chunk.js

The February zimit2 ZIM contains C/developer.mozilla.org/static/js/browser-compatibility-table.2fc46158.chunk.js, but neither it nor anything like it is called from the ZIM for the Window.showOpenFilePicker page.

kelson42 commented 5 months ago

Fixed by https://github.com/openzim/warc2zim/pull/192