openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
41 stars 2 forks source link

Insecure (`http:`) landing page causing issues with `developer.mozilla.org` scrapes #730

Closed Jaifroid closed 5 months ago

Jaifroid commented 1 year ago

The last good scrape of this useful and popular site was in February (developer.mozilla.org_en_all_2023-02.zim). There is a more recent scrape in the dev download server from September (developer.mozilla.org_en_all_2023-09.zim) but there is an error loading the landing page of this ZIM.

I've looked into the reason for this, and it appears to be because the landing page is an insecure URL (http://developer.mozilla.org/en-US/, note the http: protocol). This gets redirected via a 301 WARC header (C/H/developer.mozilla.org/en-US/, see header content below) to a secure port https://developer.mozilla.org:443/en-US/ (note the :443 port). However, the corresponding page has not been scraped, no doubt because it is out of scope due to the port.

The scrape could possibly be fixed by ensuring the starting URL is https:, though that may be a simplistic assumption.

WARC/1.0
WARC-Type: response
WARC-Record-ID: 
WARC-Target-URI: http://developer.mozilla.org/en-US/
WARC-Date: 2023-09-04T20:10:54Z
WARC-IP-Address: 34.111.97.67
Content-Type: application/http; msgtype=response
Content-Length: 206
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-Block-Digest: sha1:SJ7PZ62BHV4C53EVRWCEKPXZVBBQDDSV

HTTP/1.1 301 Moved Permanently
Cache-Control: private
Location: https://developer.mozilla.org:443/en-US/
Content-Length: 0
Date: Mon, 04 Sep 2023 20:10:54 GMT
Content-Type: text/html; charset=UTF-8

Pinging @rgaudin because I believe this ZIM is one of the Zimit test-case ZIMs.

rgaudin commented 1 year ago

They definitely have difficulties running a website 😉 https://github.com/openzim/zimit/issues/171

rgaudin commented 1 year ago

@Jaifroid starting URL is already https https://developer.mozilla.org

Jaifroid commented 1 year ago

Yes, this is strange. So, the C/A/index.html of this ZIM (referenced by W/mainPage), is below, and as you can see it contains a script which sets window.mainUrl to a secure page https://developer.mozilla.org/en-US/, which is what the reader attempts to load. But the corresponding page, C/A/developer.mozilla.org/en-US/ is not in the ZIM. Instead, there is a header C/H/developer.mozilla.org/en-US/ which contains the above 301: moved permanently WARC record which is a record of a visit to an insecure page (http:), being redirected by the server to a secure page and port.

Would allowing https://developer.mozilla.org:443/en-US/ (with the port) as in scope potentially fix the issue? I realize it would be a workaround...

<html>
<head>
<meta charset="utf-8"/>
<script>
window.mainUrl = "https://developer.mozilla.org/en-US/";
</script>
<script src="./load.js"></script>
</head>
<body>
<div id="loading" style="width: 100%; text-align: center">
    <svg width="38" height="38" viewBox="0 0 38 38" xmlns="http://www.w3.org/2000/svg" stroke="#fff">
        <g fill="none" fill-rule="evenodd">
            <g transform="translate(1 1)" stroke-width="2">
                <path d="M36 18c0-9.94-8.06-18-18-18" stroke="gray">
                    <animateTransform
                        attributeName="transform"
                        type="rotate"
                        from="0 18 18"
                        to="360 18 18"
                        dur="1s"
                        repeatCount="indefinite"/>
                </path>
            </g>
        </g>
    </svg>
</div>
<div id="error"></div>
</body>
</html>
rgaudin commented 1 year ago

Would allowing https://developer.mozilla.org:443/en-US/ (with the port) as in scope potentially fix the issue? I realize it would be a workaround...

That's what I think but I need to test the regexp locally.

It's a satisfactory solution given the website is misbehaving using unusual methods.

kelson42 commented 5 months ago

@Jaifroid Is this ticket still valid with latest dev version? https://dev.library.kiwix.org/#lang=&q=Mdn

Jaifroid commented 5 months ago

Ah no, this was fixed a long time ago... Both for Zimit1 and for Zimit2.