kiwix / overview

https://kiwix.org
88 stars 14 forks source link

zimit v2. [libzim/libkiwix/warc2zim part] #95

Closed mgautierfr closed 10 months ago

mgautierfr commented 1 year ago

This is a ticket to list what need to be done to make the PR https://github.com/openzim/warc2zim/pull/113 going from a POC to a real feature.


Specification

Improvement of the current specification to support warc2zim requirement.

Following openzim/warc2zim#113 we need to make evolved the current kiwix/zim format.

The zim file format itself (binary way to store content) will not evolved we will make evolved the "kiwix" format (what we store and how we interpret it). While this is the "kiwix" format which evolve, this is still a low level change anyway and we may change libzim itself (both at reading and creation time) to support this new format.

The main change is :

Aliasing

WARC file contains revisit : Entry which need to be served by the content of another one. The current POC use H namespace to store redirects that need to be handled as alias. We can do the same. Or we can do as hard link are done : Two (or more) entries are content entries and point to the same content (blob/cluster id or redirect id)

Using "hard link" would need to adapt the libzim creator side but no change at all would be needed on specification or reading part. (However, zim-check would need to be adapted as it will find duplicate content)

Fuzzy Matching

Fuzzy matching is a way to transform a (potentially not fixed) url into a fixed, known one.

There is two part for fuzzy matching:

On the specification part, we need to define how we store the reading fuzzy matching rules. Also need to define who is applying it (we need to access the query string, is it libkiwix doing several request to libzim ? Or libzim doing the transformation, but we need to pass it the query string ?)

Implementation

Warc2zim

Once libzim/libkiwix is providing the needed feature, we need to adapt warc2zim.

Common url schema

We need to define where (using which url) we store our entries. I suggest:

This way, "origin host" url are the same as "non zimit" zim file. We also remove the A/H sub-directory which is a relic of namespaces.

Implementation

Other projects:

Open questions :

mgautierfr commented 1 year ago

As @rgaudin mention in https://github.com/kiwix/kiwix-android/issues/3485#issuecomment-1727968808 (I've totally missed the explained behavior), we have to properly handle external link.

Static rewriting

We are (will) rewriting all links (<scheme>://host.tld/path) to /host.tld/path(make relative to the current path the content). So all external link are now internal.

The only way I see to avoid that (if we want to avoid it) is to parse a first time the warc to know all the entries and then do the (classic) handling of content but rewrite only link to existing entry (and keep other link as external links)

Dynamic rewrite

We do the same as static rewrite. But there, we are in the browser and checking if the entry exist before we rewrite it means at least a request. May be better to rewrite all and do the request to the server and let the server handle it.

Server handling

If we have a request for a non existent entry and if path is /host.tld/path, we may want to create a redirect to <scheme>://host.tld/path and send it back to the server. (If we do so, we may not need to do a complex static rewrite as we will handle it here)

Questions :

In fact, if we accept only links navigating to other website and we assume they can be only in html pages, we can simply do the "complex" static rewrite and we are good.

Jaifroid commented 12 months ago

@mgautierfr Thank you for documenting your thinking so carefully. I've belatedly read through it.

There's a lot here, but three things stood out for me:

1. Common URL schema

You propose:

From the work I've done with warc2zim, I'm really not sure this is a valid distinction. I have noticed that some ZIMs contain valid resources to a wide range of sites. And if you think about it, this is necessary given that a page may be grabbing its JS from a CDN, or images from another domain owned by the company, and especially for video which is almost always from a different domain but is often embedded in a page, and may be first party (or may be YouTube).

It could get very difficult to decide what is first-party and what is third-party, and I think having a rigid distinction like that could break some sites.

An example: a recent Mozilla Development Network scrape contains not only pages from MDN, but also several older MDN pages from archive.org that are linked to and scraped and displayed offline in the ZIM inside an archive.org frame! Now, that may be a mistake by the person who launched that scrape, but in other cases it won't be a mistake. I'm not sure the distinction holds. It might be better to design a more flexible format upfront that allows arbitrary numbers of domains to be stored. Currently this is actually quite logical. The domain name is included in the ZIM URL, like C/A/iep.utm.edu, without any distinction or hierarchy about what is first-party and what isn't. Look at the variation here at the beginning of the URL index of Internet Encyclopaedia of Philosophy (several different domains recorded):

image

2. Usefulness of Headers (pseudo H namespace)

My custom implementation in the PWA is designed mostly to make largely static resources readable (though it can rewrite most links in CSS and JS scripts, just not those that are constructed highly dynamically at run-time unless I'm lucky). Although I mostly ignore the headers, I found that sometimes they are needed. The main use case was to find a redirected resource. Sometimes that information is in the initial response body, but sometimes the server has only sent a redirect header, and there is no Response body. So, I have a recursive lookup: if a requested resource is not found at C/A/some.web.site/some_resource.html?very&cool&one (and there is no response body I can parse), then I launch a lookup for C/H/some.web.site/some_resource.html?very&cool&one, and look in the header for a moved permanently redirect, and follow it if necessary. If the header lookup fails to yield a resource, then I can know that we're dealing with an external resource link that wasn't scraped. In that case, I throw up an external link dialogue box for the user to decide if they want to leave the app and open the link in a browser.

Now, while redirect may be the main use case, there are several other reasons to use the headers in more dynamic situations. The Service Worker has the logic that deals with this. I found 18 references to a function response.headers.get() in wabac.js, dealing with these situations, which gives an idea of the contexts in which they are needed. Note that Headers can either be of type "response" or of type "request". There are many more references to response headers (what is mostly stored in H/ namespace) than request headers, though there are some. I focus on response headers here:


// 1. REDIRECTS:

const status = Number(response.headers.get("x-redirect-status") || response.status);
const statusText = response.headers.get("x-redirect-statusText") || response.statusText;

// 2. MIME TYPES / transfer encoding

mime = response.headers.get("Content-Type") || "";
const encoding = response.headers.get("content-encoding");
const te = response.headers.get("transfer-encoding");

// 3. COOKIES

let presetCookie = response.headers.get("x-wabac-preset-cookie") || "";
const setCookie = response.headers.get("Set-Cookie");

// 4. COMPENSATING FOR SW RUNNING IN AN EXTENSION (**could be important for Kiwix JS!**)

// necessary as service worker seem to not be allowed to return a redirect in some circumstances (eg. in extension)
    if ((request.destination === "video" || request.destination === "audio") && request.mode !== "navigate") {
      while (response && (response.status >= 301 && response.status < 400)) {
        const newUrl = new URL(response.headers.get("location"), url);

// 5. FORMS and UPLOADS

// ... in a series of functions dealing with forms and posting content / authorizations
const lengthHeader = response.headers.get('x-ipfs-datasize') || response.headers.get('Content-Length') 
// ... function dealing with uploading files
return response.headers.get('Location')

// 6. FETCH AND RANGE REQUESTS

// ... In the Fetch Range Loaders class
this.canLoadOnDemand = ((response.status === 206) || response.headers.get("Accept-Ranges") === "bytes");
// ... Getting content length of range requests
this.length = Number(response.headers.get("Content-Length"));
let range = response.headers.get("Content-Range");
// ... In the Remote WARC proxy class (there are some comments here referring to bugs in Kiwix Serve!)
let { headers, encodedUrl, date, status, statusText, hasPayload } = headersData;
      if (reqHeaders.has("Range")) {
        const range = reqHeaders.get("Range");
        // ensure uppercase range to avoid bug in kiwix-serve
        reqHeaders = {"Range": range};
      }

// 7. AJAX REQUESTS

try {
      if (this.allowRewrittenCache && !range) {
        const response = await self.caches.match(request);
        if (response && !!response.headers.get(IS_AJAX_HEADER) === isAjax) {
          return response;
        }
      }
    }

My conclusion about Headers

I found the seven broad categories above where Response Headers are needed (and there is some code for Request Headers too). So, ISTM that to deal with the huge variety of situations in which we may have things such as range requests (especially for streaming data), or AJAX or Fetch requests, and the fact that WARC can intercept these and record the responses, it would be risky to ditch the capacity for storing and using the Headers.


3. Video BLOBs or streams of requests and responses?

You ask above whether video is stored (effectively) as BLOBs or as streams (chunks). I think the point of the WARC format is that it could be either. I don't think the fact that the Android app reads BLOBs from the ZIM in a normal (non-WARC ZIM) is relevant. If the Service Worker is doing its job correctly, it will bypass that. All the Service Worker is doing is effectively intercepting requests and providing responses (yes, it has a lot of logic to do transformations, but basically it is just doing what all Service Workers do: there is an event listener on the Fetch event, and the SW does event.respondWith( [Response with Data] )). WARC is just a recorder of Requests and Responses.

So, my experience is that in MOST cases of YouTube videos (the ones I have implemented in the PWA), there is an identifiable MP4 BLOB (after fuzzy URL transformation / reduction). But of course YouTube COULD simply stream video chunks, and have some complex JS reader that recombines them only when the right authentication response has been sent to the server. The WARC format doesn't care about this. It will merely record the authentication response sent to the server and the encoded chunks received, and the piece of JS that recombines the chunks will be happy. And, I think, Kiwix Android will also be happy because it's not reading the video in the way it would read video from a Wikimedia ZIM file. The webview is just making a request, and the response is elicited from the ZIM by the Service Worker's transformation functions, and these are sent back to the WebView, which has a JS player, and all is good (maybe!).

In any case, I don't think it's safe to assume we'll always have a BLOB to play rather than a stream. We need to design Zimit 2.0 in a way that is flexible and future-proof, which means that multimedia content is also just a set of requests and responses.

mgautierfr commented 11 months ago

1. Common URL schema

I think you misunderstood the url schema. When we scrap (or convert a warc of) a website (ie http://kiwix.org) we know that main domain is kiwix.org. So when we need to store a entry with a url:

We can still store any content from any website. Without any limitation. It is just that we have one domain which is elided from the entry path and we know this is the "main" domain of the scrapped website.

The main purpose is to avoid to have the domain visible in the url from a user point of view (http://library.kiwix.org/viewer#kiwix/kiwix./org/en/about-us/ vs http://library.kiwix.org/viewer#kiwix/en/about-us/).

2. Usefulness of Headers (pseudo H namespace)

1. REDIRECTS:

const status = Number(response.headers.get("x-redirect-status") || response.status); const statusText = response.headers.get("x-redirect-statusText") || response.statusText;

We already a mechanism for redirection. We should use them. If a warc record contains a redirection response, we must create a redirect entry. No need for header for that.

2. MIME TYPES / transfer encoding

For mime types, as for redirect, we can already store it in the zim. Encoding is part of the negotiation between the server and the client. We MUST handle it correctly. We cannot return a content deflated if the client can't inflate it (even if we have scrapped it with a client which can)

3. COOKIES

let presetCookie = response.headers.get("x-wabac-preset-cookie") || ""; const setCookie = response.headers.get("Set-Cookie");

That's a interesting point. But it appears that cookies is my next thing to make work. So I will see :)

4. COMPENSATING FOR SW RUNNING IN AN EXTENSION (could be important for Kiwix JS!)

You should be able to get this information (redirect) from classic zim file as we will store classique redirect entry (or alias, which will lead to even less work on your side)

5. FORMS and UPLOADS

// ... in a series of functions dealing with forms and posting content / authorizations const lengthHeader = response.headers.get('x-ipfs-datasize') || response.headers.get('Content-Length')

I wonder why you need the lengthHeader. By definition, the server doesn't handle POST request so it is somehow useless to send data to the server. (And on warc2zim, we move all data of a POST request in the entry path querystring __wb_method=POST&<post_data>)

// ... function dealing with uploading files return response.headers.get('Location')

This is same a redirect

6. FETCH AND RANGE REQUESTS

Indeed this is something we have to handle. But we can move this information in the path, as we do for POST data.

7. AJAX REQUESTS

try { if (this.allowRewrittenCache && !range) { const response = await self.caches.match(request); if (response && !!response.headers.get(IS_AJAX_HEADER) === isAjax) { return response; } } }

What do you do if it is not response ? Rewrite the content ? If yes, it will be handle by warc2zim which has access to the header. Server never rewrite content.

it would be risky to ditch the capacity for storing and using the Headers.

We never had the capacity to store and using the Header :) So we ditch nothing :) Adding the feature now, without knowing how to use it is useless. If we find that we have to store and use header, we will see at this time.

3. Video BLOBs or streams of requests and responses?

I don't think the fact that the Android app reads BLOBs from the ZIM in a normal (non-WARC ZIM) is relevant. If the Service Worker is doing its job correctly, it will bypass that.

Well, the purpose of zimit v2 is to not have a Service Worker. So no one can do its job correctly (or not).

And, I think, Kiwix Android will also be happy because it's not reading the video in the way it would read video from a Wikimedia ZIM file. The webview is just making a request, and the response is elicited from the ZIM by the Service Worker's transformation functions, and these are sent back to the WebView, which has a JS player, and all is good (maybe!).

If I understand correctly the android behavior, the purpose it to not use the js player (or the webplayer) but use the "native player". It allows the video to be directly played by android native code, bypassing all the app/webview/server/libzim code. But to do this, we need a contiguous data.

In any case, I don't think it's safe to assume we'll always have a BLOB to play rather than a stream. We need to design Zimit 2.0 in a way that is flexible and future-proof, which means that multimedia content is also just a set of requests and responses.

I agree, but it has a impact on readers that have this assumption. (And a valid one as we didn't have a way to store different range of data in different entries, so we always had one entry per content)


BTW, here a small teaser of a zim created with dev of warc2zim. It is without service worker and should work without fuzzy matching or any fancy stuff. (Not working, at least : cookies, external link handling)

rgaudin commented 11 months ago

Common URL schema

I'd also prefer a single way to store entries, for the sake of not having to handle two. Maybe this was chosen to have better-looking paths for the main domain. You discussed mostly resources which are indeed frequently on different domains but first parties on various domains are allowed. We don't use it much but it's perfectly valid to have pages on multiple domains (even if not related) and browsertrix makes no distinction. It just has a concept of seedUrls and it's only in warc2zim that we look at initial URL to set the homepage.

@mgautierfr what's the reason for the two entries format?

Usefulness of Headers

Thank you for laying them all out. It's really useful. We've discussed a couple of them as theoretical possibilities but haven't encountered them in reality.

It all looks like it can be gradually introduced back. We should probably setup a bunch of websites that trigger and uses some of those use cases so we can have automated tests.

Video BLOBs or streams of requests and responses?

I've said the same thing a few times but lacked an actual use case to back it up. It's very frequent on my own laptop to see non-blobs being transferred ; and there are multiple competing stream technologies. I think this can be somewhat controlled though at scraping time because most platforms support the various client capabilities that are found in the wild.

Each need to be implemented thouhg

mgautierfr commented 11 months ago

You discussed mostly resources which are indeed frequently on different domains but first parties on various domains are allowed. We don't use it much but it's perfectly valid to have pages on multiple domains (even if not related) and browsertrix makes no distinction

It is still allowed with the schema proposed (and implemented for now).

@mgautierfr what's the reason for the two entries format?

Just have urls which look like we used to.

Storing the host in the entry path (<host.tld>/foo/bar) is "just" needed to avoid conflict between resource with the same path but different source (kiwix.org/index.html vs wikipedia.org/index.html). But we can elide one (and only one) domain from the path and we still have conflict avoided (index.html vs wikipedia.org/index.html)

http://public.kymeria.fr/KIWIX/zimit2/kiwix_no_main_domain.zim is the same zim without url simplified.

rgaudin commented 11 months ago

Yep, I saw your comment just after publishing mine.

Jaifroid commented 11 months ago

Thanks for the explanations and reassurances, @mgautierfr. I hope at least that the research on the use cases of headers was useful. I hadn't understood the logic behind the URL proposal -- I see now that it's just a form of abbreviation, and in fact it works just as well without the abbreviation, so it's optional. Presumably the main use case for abbreviated URLs is in browsers accessing a ZIM via Kiwix Serve, because I don't think in any other context users are particularly aware of URLs (and in many contexts, they can't see thm at all).

The main reason for POST requests would be to record visits to sites where a POST is used to get a resource without it being in the URL (as POSTing without relying on querystrings is considered more secure). But I imagine this is a bit unlikely for a ZIM, except for google video, which you've already implemented via a separate process.

Congratulations 🎉on those ZIM samples. I've tested both in Kiwix JS and in Kiwix PWA, and (apart from a small issue with some hyperlinks having a /C/ in them that should be easy to fix in our Service Worker, that comes from differences in our backend) they are working very well: all JS, CSS, etc. is loading correctly on the landing page, and most hyperlinks work fine. That's certainly remarkable!

kelson42 commented 11 months ago

I jump on this very long discussion. I hope I get it right and make a useful comment. That said, I would really prefer to have one ticket per fundamental change. That said:

rgaudin commented 11 months ago

If we do elide one (main) FQDN, like we do today, we can not fully avoid conflict of that kind:

Good point!

my first impression would be to do a static rewriting - like in other scrapers

Attention, this comparison is too simple: we only do this in select scrapers (sotoki, mwoffliner and maybe wikihow) for which we know we're working off a tiny list of basic nodes. This can't be compared with zimit where possibilities are all those offered by HTML and JS. That's why we rely (or will be relying) on Wombat.js

Not sure what you meant with “static rewriting” but if the goal is the same, the implementation is gonna be different (and more complex): an external link has no other property than “not being in the ZIM”. Wombat running in the client (to intercept calls), client side must be able find out if an entry is in ZIM or not.

mgautierfr commented 11 months ago

If we do elide one (main) FQDN, like we do today, we can not fully avoid conflict of that kind: www.kiwix.org/www.cloudfare.com/index.html with wwww.cloudfare.com/index.html. Yes that sounds improbable, but when this will happen... what should be done? I think eliding the main FQDN is a nice feature, but before continuing with it, I would like to be sure AFAP about the handling of all the edge cases... otherwise lets keep it simple and remove this "optimisation".

I agree. Wombat is a too complex and sensitive (at least with my knowledge) to play to much with him. I have made the eliding optional and I'm testing without it.

my first impression would be to do a static rewriting - like in other scrapers

We (will) do static rewriting. In the example zim files, all html (almost, not html for ajax requests) content is statically rewritten. But we need to dynamically rewrite url (coming from js request) and content (response of ajax request)

Jaifroid commented 11 months ago

@mgautierfr Having finally managed to integrate the Replay system (with Service Worker) in https://github.com/kiwix/kiwix-js/pull/1173, I have a better understanding of the importance of headers. You wrote above:

WARC Headers contain... headers. Apart from a simple parsing to detect revisit, we (POC) currently don't care and it seems to work. We have to investigate this part. Do we need them ? If yes, how ? (Storage in zim file, handling of headers in the routing...)

I've come to realize (belatedly) that while the headers are not (generally) important for looking up assets from the backend / server, they are potentially important instructions to the user's browser about how to deal with those assets. I apologize in advance if that's really obvious to everyone else, but I think in previous discussions we (or at least I) were focusing on how they might help us look up assets directly (the simple revisits you mention), rather than the fact they tell the browser how to deal with retrieved payloads.

Overview

Currently a client accessing a Zimit article via Kiwix Serve will:

  1. look up the Headers (via a C/H/... ZIM url);
  2. then look up the Response body, assuming the header indicates it has a payload (via a C/A/... ZIM url);
  3. if there is no payload, it will create a Response with the Headers and an empty Uint8Array for the payload;
  4. the Service Worker combines the retrieved Header and the retrieved Payload into a single Response that is returned to the browser;
  5. the browser decides what to do with the Response, based on standard instructions in the Header.

Code

The high-level code that does this in the Replay Service Worker is below. I've added some comments to make it quicker to parse (for a human), but the comment about a bug in Kiwix serve, and the if (this.type === "kiwix") block are in the original.

Obviously, there's a lot more going on behind this top-level code, but for me it gives the clearest picture of what is happening, and therefore how / if to emulate use of headers in Zimit 2.0. Again, sorry if this is stating the obvious, but it seems useful to document it, even if only for the benefit of others finding this:

// The main function for getting the resource from the ZIM / server
async getResource(request, prefix) {
    const { url, headers } = request.prepareProxyRequest(prefix);
    let reqHeaders = headers;

    if (this.type === "kiwix") {

      // Get the headers from the ZIM
      // console.debug('Attempting to resolve canonical headers for url', url);
      let headersData = await this.resolveHeaders(url);

      // If we couldn't find the requested header, do some "fuzzy matching"
      if (!headersData) {
        for (const newUrl of fuzzyMatcher.getFuzzyCanonsWithArgs(url)) {
          // A bunch of code that deals with fuzzy matching...
        }
      }

      // If we still can't find the headers, show a Not Found page
      if (!headersData) {
        // use custom error page for navigate events [ORIGINAL COMMENT]
        if (this.notFoundPageUrl && request.mode === "navigate") {
          // Code that deals with the built-in Not Found page
          ....
        }
        return null;
      }

      // Define Header data, some of which we have to construct if missing
      let { headers, encodedUrl, date, status, statusText, hasPayload } = headersData;

      // Deal with Range requests if there is a Range header
      if (reqHeaders.has("Range")) {
        const range = reqHeaders.get("Range");
        // ensure uppercase range to avoid bug in kiwix-serve [ORIGINAL COMMENT]
        reqHeaders = {"Range": range};
      }

      let payload = null;
      let response = null;

      if (hasPayload) {
        // Get the response from the ZIM in the case of Kiwix, if the headers indicated a payload was recorded
        response = await fetch(this.sourceUrl + "A/" + encodedUrl, {headers: reqHeaders});
        if (response.body) {
          payload = new c(response.body.getReader(), false);
        }

        // Deal with partial responses (probably from a Range request above)
        if (response.status === 206) {
          status = 206;
          statusText = "Partial Content";
          headers.set("Content-Length", response.headers.get("Content-Length"));
          headers.set("Content-Range", response.headers.get("Content-Range"));
          headers.set("Accept-Ranges", "bytes");
        }
      }

      // Deal with responses that don't have a payload, i.e. pure headers
      if (!payload) {
        payload = new Uint8Array([]);
      }
      if (!date) {
        date = new Date();
      }
      if (!headers) {
        headers = new Headers();
      }

      const isLive = false;
      const noRW = false;

      // Finally, construct a Response from the collected data and return it to the browser
      return new ArchiveResponse({payload, status, statusText, headers, url, date, noRW, isLive});

    }
  }
Jaifroid commented 11 months ago

@mgautierfr A potentially interesting observation from my work on enabling Replay support in Kiwix JS-family apps. NB, this is not a recommendation to change approach, just an observation that might be useful as a fallback. So, I'm just putting it out there. Feel free to shoot this down! I think your approach is ultimately more universal, as it creates a standard ZIM that existing readers should be able to use without changes to their backends.

I realized belatedly that wabac.js can run as a Web Worker instead of as a Service Worker. When you run it as a Worker, it writes ww init in console instead of sw init, and adjusts itself accordingly. In that mode, you simply postMessage URLs you want to transform to it, and it replies with the ZIM URL once it has transformed it (so long as it's initialized the right way).

Now I wondered if the webview used by Kiwix Desktop can run Web Workers. They're pretty old technology. Even IE11 can run a Web Worker (though obviously not this one, because it uses very advanced JS, lots of async, etc.). And if that's the case, can the Webview catch all Fetch requests within its scope (like what a Service Worker does)? Again, if that's the case, ISTM that there might be a "simple" solution for supporting current Zimit ZIMs. (Well, nothing is ever "simple"...)

Of course, this would also require work in the Kiwix Desktop backend, and probably, like in Kiwix JS, would require hosting your own custom copy of wabac.js (I renamed it replayWorker.js for clarity) and topIframe.html, and also forcing the settings in the Web Worker so it recognizes the path of URLs sent to it. I had to write some code that alters the internal state of the Web Worker's configuration, and if necessary change it each time there's a request so that I could support multi-ZIM request coming in arbitrarily. There's probably a standard way of doing it, but there's no API documentation, and much of how it initializes itself is pretty obscure (almost no comments in the code), so I wrote a simple routine that does the job fast.

Caveats:

kelson42 commented 10 months ago

Closing in favour of https://github.com/openzim/zimit/issues/193. See also https://github.com/orgs/openzim/projects/10