Works only on fresh browser

Popolechien commented 2 years ago

Trying to open ncert-audiobooks_en_all_2022-07, I get the following error message: Sorry, the url https://ciet.nic.in/pages.php?id=audiobook&ln=en is not found on this server

BUT I get a different message if I open it from a "Clean" browser (empty cache: firefox or chrome, or incognito mode)

Not Found
The requested URL "/content/ncert-audiobooks_en_all_2022-07/A/ciet.nic.in/pages.php" was not found on this server.

Make a full text search for [pages.php](https://dev.library.kiwix.org/search?content=ncert-audiobooks_en_all_2022-07&pattern=pages.php)

Clicking on the full search link actually returns a few results, and from then on the home page appears no problem.

kelson42 commented 2 years ago

If there is a bug here, should be a bug in kiwix-serve. We should not have to clean any cache.

Popolechien commented 2 years ago

Same behaviour with IMSMA recipe: see https://dev.library.kiwix.org/content/mwiki_en_all_maxi_2022-08

kelson42 commented 2 years ago

@Popolechien Is the problem that you get a different result after cleaning the cache? Or somwthing else?

Popolechien commented 2 years ago

@kelson42 I would say that the main problem is that the files can't be read outright, and even when I can read it after a while the cache issue appears. Might be two separate problems but I'm not entirely sure.

rgaudin commented 2 years ago

We've seen this already (and I believe from the very beginning of zimit) but never could really pin it down as it only affects some ZIMs and trying to isolate the issue or the cache feature you end up not being able to reproduce it.

Right now, on my main browser, I am affected with above link but a clean browser is not and I can access the content. What's difficult is that we don't know and can't control exactly what's cache in the browser: SW for instance are not unregistered automatically. Same goes for IndexDBs that zimit uses.

I had never tried this ZIM before and it was not working on my main browser on first try… so the cause may not be entirely ZIM-specific but deployment-related (since I've access many on dev.library.kiwix.org).

@ikreymer you help is requested

kelson42 commented 2 years ago

Indeed, we have had a few tickets around caching of SW based ZIM files. I would wait that https://github.com/kiwix/libkiwix/issues/650 is implemented before digging in.

rgaudin commented 2 years ago

Ah I forgot to paste the log part of me accessing said ZIM in dev.library, using my main FF browser (not working)

======================
Requesting : 
full_url  : /ncert-audiobooks_en_all_2022-07/
method    : GET (0)
version   : HTTP/1.1
request#  : 128660
headers   :
 - accept : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8'
 - accept-encoding : 'gzip, deflate, br'
 - accept-language : 'en-US,en;q=0.8,fr-FR;q=0.5,fr;q=0.3'
 - dnt : '1'
 - host : 'dev.library.kiwix.org'
 - referer : 'https://dev.library.kiwix.org/content/ncert-audiobooks_en_all_2022-07/A/ciet.nic.in/pages.php?id=audiobook&ln=en'
 - sec-fetch-dest : 'document'
 - sec-fetch-mode : 'navigate'
 - sec-fetch-site : 'same-origin'
 - sec-fetch-user : '?1'
 - upgrade-insecure-requests : '1'
 - user-agent : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:103.0) Gecko/20100101 Firefox/103.0'
 - x-forwarded-for : '196.200.95.159'
 - x-forwarded-host : 'dev.library.kiwix.org'
 - x-forwarded-port : '443'
 - x-forwarded-proto : 'https'
 - x-forwarded-scheme : 'https'
 - x-real-ip : '196.200.95.159'
 - x-request-id : 'b7f69c417569e81448bc900a9bab45e7'
 - x-scheme : 'https'
arguments :
Parsed : 
full_url: /ncert-audiobooks_en_all_2022-07/
url   : /ncert-audiobooks_en_all_2022-07/
acceptEncodingGzip : 1
has_range : 0
is_valid_url : 1
.............
Response :
httpResponseCode : 302
headers :
 - Location: '/content/ncert-audiobooks_en_all_2022-07/'
 - Access-Control-Allow-Origin: '*'
 - Cache-Control: 'no-cache, no-store, must-revalidate'
Request time : 0.000260s
----------------------
======================
Requesting : 
full_url  : /content/ncert-audiobooks_en_all_2022-07/
method    : GET (0)
version   : HTTP/1.1
request#  : 128661
headers   :
 - accept : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8'
 - accept-encoding : 'gzip, deflate, br'
 - accept-language : 'en-US,en;q=0.8,fr-FR;q=0.5,fr;q=0.3'
 - dnt : '1'
 - host : 'dev.library.kiwix.org'
 - referer : 'https://dev.library.kiwix.org/content/ncert-audiobooks_en_all_2022-07/A/ciet.nic.in/pages.php?id=audiobook&ln=en'
 - sec-fetch-dest : 'document'
 - sec-fetch-mode : 'navigate'
 - sec-fetch-site : 'same-origin'
 - sec-fetch-user : '?1'
 - upgrade-insecure-requests : '1'
 - user-agent : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:103.0) Gecko/20100101 Firefox/103.0'
 - x-forwarded-for : '196.200.95.159'
 - x-forwarded-host : 'dev.library.kiwix.org'
 - x-forwarded-port : '443'
 - x-forwarded-proto : 'https'
 - x-forwarded-scheme : 'https'
 - x-real-ip : '196.200.95.159'
 - x-request-id : '36ae21d84926671f26f90d06f1465010'
 - x-scheme : 'https'
arguments :
Parsed : 
full_url: /content/ncert-audiobooks_en_all_2022-07/
url   : /content/ncert-audiobooks_en_all_2022-07/
acceptEncodingGzip : 1
has_range : 0
is_valid_url : 1
.............
** running handle_content
Response :
httpResponseCode : 302
headers :
 - Location: '/content/ncert-audiobooks_en_all_2022-07/A/index.html'
 - Access-Control-Allow-Origin: '*'
 - Cache-Control: 'no-cache, no-store, must-revalidate'
Request time : 0.000200s
----------------------
======================
Requesting : 
full_url  : /content/ncert-audiobooks_en_all_2022-07/A/undefinedH/ciet.nic.in/pages.php?id=audiobook&ln=en
method    : GET (0)
version   : HTTP/1.1
request#  : 128663
headers   :
 - accept : '*/*'
 - accept-encoding : 'gzip, deflate, br'
 - accept-language : 'en-US,en;q=0.8,fr-FR;q=0.5,fr;q=0.3'
 - dnt : '1'
 - host : 'dev.library.kiwix.org'
 - referer : 'https://dev.library.kiwix.org/content/ncert-audiobooks_en_all_2022-07/A/sw.js?replayPrefix=&root=ncertaudiobooks_en_all_2022-07'
 - sec-fetch-dest : 'empty'
 - sec-fetch-mode : 'cors'
 - sec-fetch-site : 'same-origin'
 - user-agent : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:103.0) Gecko/20100101 Firefox/103.0'
 - x-forwarded-for : '196.200.95.159'
 - x-forwarded-host : 'dev.library.kiwix.org'
 - x-forwarded-port : '443'
 - x-forwarded-proto : 'https'
 - x-forwarded-scheme : 'https'
 - x-real-ip : '196.200.95.159'
 - x-request-id : '551d3716a6262cf4f76eac7d83c08f58'
 - x-scheme : 'https'
arguments :
Parsed : 
full_url: /content/ncert-audiobooks_en_all_2022-07/A/undefinedH/ciet.nic.in/pages.php?id=audiobook&ln=en
url   : /content/ncert-audiobooks_en_all_2022-07/A/undefinedH/ciet.nic.in/pages.php?id=audiobook&ln=en
acceptEncodingGzip : 1
has_range : 0
is_valid_url : 1
.............
** running handle_content
Failed to find A/undefinedH/ciet.nic.in/pages.php?id=audiobook&ln=en
Response :
httpResponseCode : 404
headers :
 - Content-Type: 'text/html; charset=utf-8'
 - Access-Control-Allow-Origin: '*'
 - Cache-Control: 'no-cache, no-store, must-revalidate'
 - Content-Encoding: 'gzip'
 - Vary: 'Accept-Encoding'
Request time : 0.003001s
----------------------
======================
Requesting : 
full_url  : /content/ncert-audiobooks_en_all_2022-07/A/undefinedH/ciet.nic.in/pages.php?
method    : GET (0)
version   : HTTP/1.1
request#  : 128667
headers   :
 - accept : '*/*'
 - accept-encoding : 'gzip, deflate, br'
 - accept-language : 'en-US,en;q=0.8,fr-FR;q=0.5,fr;q=0.3'
 - dnt : '1'
 - host : 'dev.library.kiwix.org'
 - referer : 'https://dev.library.kiwix.org/content/ncert-audiobooks_en_all_2022-07/A/sw.js?replayPrefix=&root=ncertaudiobooks_en_all_2022-07'
 - sec-fetch-dest : 'empty'
 - sec-fetch-mode : 'cors'
 - sec-fetch-site : 'same-origin'
 - user-agent : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:103.0) Gecko/20100101 Firefox/103.0'
 - x-forwarded-for : '196.200.95.159'
 - x-forwarded-host : 'dev.library.kiwix.org'
 - x-forwarded-port : '443'
 - x-forwarded-proto : 'https'
 - x-forwarded-scheme : 'https'
 - x-real-ip : '196.200.95.159'
 - x-request-id : '43eef189d9ebc2b7da61e2fbb1a3a01f'
 - x-scheme : 'https'
arguments :
Parsed : 
full_url: /content/ncert-audiobooks_en_all_2022-07/A/undefinedH/ciet.nic.in/pages.php?
url   : /content/ncert-audiobooks_en_all_2022-07/A/undefinedH/ciet.nic.in/pages.php?
acceptEncodingGzip : 1
has_range : 0
is_valid_url : 1
.............
** running handle_content
Failed to find A/undefinedH/ciet.nic.in/pages.php?
Response :
httpResponseCode : 404
headers :
 - Content-Type: 'text/html; charset=utf-8'
 - Access-Control-Allow-Origin: '*'
 - Cache-Control: 'no-cache, no-store, must-revalidate'
 - Content-Encoding: 'gzip'
 - Vary: 'Accept-Encoding'
Request time : 0.004009s
----------------------
======================
Requesting : 
full_url  : /content/ncert-audiobooks_en_all_2022-07/A/sw.js
method    : GET (0)
version   : HTTP/1.1
request#  : 128669
headers   :
 - accept : '*/*'
 - accept-encoding : 'gzip, deflate, br'
 - accept-language : 'en-US,en;q=0.8,fr-FR;q=0.5,fr;q=0.3'
 - cache-control : 'max-age=0'
 - dnt : '1'
 - host : 'dev.library.kiwix.org'
 - if-none-match : '"1660421246748244328/cz"'
 - sec-fetch-dest : 'serviceworker'
 - sec-fetch-mode : 'same-origin'
 - sec-fetch-site : 'same-origin'
 - service-worker : 'script'
 - user-agent : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:103.0) Gecko/20100101 Firefox/103.0'
 - x-forwarded-for : '196.200.95.159'
 - x-forwarded-host : 'dev.library.kiwix.org'
 - x-forwarded-port : '443'
 - x-forwarded-proto : 'https'
 - x-forwarded-scheme : 'https'
 - x-real-ip : '196.200.95.159'
 - x-request-id : '4f63276f460a85bf014b3919aa44b2ee'
 - x-scheme : 'https'
arguments :
 - replayPrefix : 
 - root : ncertaudiobooks_en_all_2022-07
Parsed : 
full_url: /content/ncert-audiobooks_en_all_2022-07/A/sw.js
url   : /content/ncert-audiobooks_en_all_2022-07/A/sw.js
acceptEncodingGzip : 1
has_range : 0
is_valid_url : 1
.............
Response :
httpResponseCode : 304
headers :
 - Vary: 'Accept-Encoding'
 - Access-Control-Allow-Origin: '*'
 - ETag: '"1660421246748244328/cz"'
 - Cache-Control: 'max-age=2723040, public'
Request time : 0.000351s
----------------------

As you can see, some requests are made to an incorrect URL because the domain is undefined

Jaifroid commented 2 years ago

I am seeing cases of this bug (I think) also with https://library.kiwix.org/courses.lumenlearning.com_en_all_2021-03 . It works fine in a fresh Firefox (and also in Kiwix JS PWA), but I cannot load any page from this ZIM when accessing it in Chromium (which has accessed the site before), neither the online version nor served via a local Kiwix Serve:

I don't know for sure if this is the same bug. If it is, then it's clearly not ZIM-specific. I'm sure we can get to the root cause of this. My hunch is it is something to do with the redirection code embedded into every page, but it needs tracing/debugging. Reason for this hunch is that Kiwix JS PWA can read these files fine, but that reader deliberately bypasses the JS redirection to the Service Worker, and has its own algorithm for finding the home page of the ZIM (since it can't use the Service Worker provided in the ZIM for this purpose).

rgaudin commented 2 years ago

Ah ! Interesting clue ; thank you

kelson42 commented 1 year ago

New kiwix-serve 3.4.0, with revamped cache strategy, has been released. Would be good to see if this fixes this bug.

ristein commented 1 year ago

thanks for the info. will it be availiable in launchpad soon? I don't wanna install manually if it will.

Popolechien commented 1 year ago

@kelson42 I don't know how to provoke the issue (it just appears somewhat randomly) but I've been poking around a good number of times and so far, so good.

rgaudin commented 1 year ago

@kelson42 I don't know how to provoke the issue (it just appears somewhat randomly) but I've been poking around a good number of times and so far, so good.

I have the same feedback but haven't looked at this files in ages. Now that both library.kiwix.org and dev-library uses the new cache, we'll see if we stop getting reports of this

kelson42 commented 1 year ago

Closing the ticket and crossing fingers this won't reappear.

Jaifroid commented 1 year ago

Well I just visited https://library.kiwix.org/viewer#courses.lumenlearning.com_en_all_2021-03/A/courses.lumenlearning.com/catalog/boundlesscourses in Edge Chromium and got:

But if I open a new InPrivate instance, the landing page shows correctly:

I may well have visited this page in the same browser before, so it may be an old copy of the Service Worker interfering.

kelson42 commented 1 year ago

@Jaifroid Can you please get from the browser the corresponding curl requests so we can better identify what are exactly the differences in the HTTP requests? I think as well you can check in your browser if you don't have an old service worker running?

Actually I'm not knowledge about the lifecycle of service workers? How does that work? How do we secure a service worker will be removed? refreshed?

Jaifroid commented 1 year ago

@kelson42 The browser won't run curl directly, as that's not a browser API, but it will make a network request which I can monitor in dev tools.

Regarding the Service Worker: the API is designed to update a Service Worker as soon as the browser detects one byte of difference between its cached copy and the online copy, so if a Service Worker is updated, the update gets pulled by the browser pretty quickly (a few seconds). What sometimes goes wrong is that the Service Worker may cache old copies of files that it needs, so most Service Workers have code to handle deleting the Cache API caches and opening new ones at the same time that the Service Worker itself is updated.

I'll see what I can find out.

Jaifroid commented 1 year ago

Here are Network requests that fail when accessing this page in a normal browser window (not InPrivate), followed by the successful requests when opening the same URL InPrivate. You can clearly see that the error arises in the Service Worker sw.js, and that this has something to do with a proxy: URI schema which is not a secure origin and is not supported by the Fetch API.

Jaifroid commented 1 year ago

And when I examine the Service Worker, the error occurs on this line (pretty-printed line):

The circled text is the value of this.sourceUrl, which as you see contains proxy:../.

One very odd thing is that there is no Service Worker running in the InPrivate window (i.e. in the window that successfully loads the page). Or if it's running, I can't find it:

Jaifroid commented 1 year ago

@rgaudin This is puzzling. Has the way the pages are served changed recently? Are we doing some server-side processing using Python Zimit modules, rather than relying on the browser doing all the processing in the Service Worker? If so, it would explain why an old Service Worker is trying to capture proxy: requests (and failing) whereas a completely fresh browser simply loads the URLs transparently, without a Service Worker, because they are being processed in the backend on the server?

Of course I can delete my copy of the Service Worker, but if I do, the opportunity to diagnose what is going wrong may be lost (temporarily?).

rgaudin commented 1 year ago

That's the problem with this ticket ; we've never been able to isolate it exactly because we get varying results trying to limit it to a reduced set of reproducing steps… and because we don't know exactly what the replayer code does.

No, nothing changed. Hard to tell what's happening: getting the raw article from the ZIM ? Viewing online version? Being served by SW without it being visible in the browser… Keep in mind that Firefox doesn't allow SW in private mode for instance which is a pain to debug.

Jaifroid commented 1 year ago

OK thanks. For the avoidance of doubt, these screenshots are all the result of accessing the online version (library.kiwix.org) rather than accessing it using a local Kiwix Serve or any other local code. I'll proceed to delete the Service Worker and see under what conditions it comes back. We know that it must run in some form or other in order to transform absolute links into relative links that the backend can handle. As you know (and for the benefit of others), it's doing a bunch of regular expression transformations and DOM-based transformations on URLs, in combination with the injected wombat.js script. We can see the many regexes that are applied for example in the source code here:

 /** @type {RegExp} */
  this.hostnamePortRe = /^[\w-]+(\.[\w-_]+)+(:\d+)(\/|$)/;

  /** @type {RegExp} */
  this.ipPortRe = /^\d+\.\d+\.\d+\.\d+(:\d+)?(\/|$)/;

  /** @type {RegExp} */
  this.workerBlobRe = /__WB_pmw\(.*?\)\.(?=postMessage\()/g;
  /** @type {RegExp} */
  this.rmCheckThisInjectRe = /_____WB\$wombat\$check\$this\$function_____\(.*?\)/g;

  /** @type {RegExp} */
  this.STYLE_REGEX = /(url\s*\(\s*[\\"']*)([^)'"]+)([\\"']*\s*\))/gi;

  /** @type {RegExp} */
  this.IMPORT_REGEX = /(@import\s*[\\"']*)([^)'";]+)([\\"']*\s*;?)/gi;

  /** @type {RegExp} */
  this.IMPORT_JS_REGEX = /^(import\s*\(['"]+)([^'"]+)(["'])/i;

  /** @type {RegExp} */
  this.no_wombatRe = /WB_wombat_/g;

  /** @type {RegExp} */
  this.srcsetRe = /\s*(\S*\s+[\d.]+[wx]),|(?:\s*,(?:\s+|(?=https?:)))/;

  /** @type {RegExp} */
  this.cookie_path_regex = /\bPath='?"?([^;'"\s]+)/i;

  /** @type {RegExp} */
  this.cookie_domain_regex = /\bDomain=([^;'"\s]+)/i;

  /** @type {RegExp} */
  this.cookie_expires_regex = /\bExpires=([^;'"]+)/gi;

  /** @type {RegExp} */
  this.SetCookieRe = /,(?![|])/;

  /** @type {RegExp} */
  this.IP_RX = /^(\d)+\.(\d)+\.(\d)+\.(\d)+$/;

Some transformation appears to be done on the article's source code (when its request is trapped by the Service Worker), and some is done on each further https:// request that the browser generates after the article is injected in the iframe. Something is going wrong if the article is generating requests to insecure resources like proxy://, or if https:// requests are being transformed server-side into proxy:// requests, because the browser will of course block those and often will generate non-helpful errors as a security measure.

Jaifroid commented 1 year ago

After some debugging, it looks like an error is being introduced in the load.js script. As you will see if you click on that link, it contains the following object that is sent to the Worker for processing:

It sets the sourceUrl as proxy:../, which is an illegal source. If I change this on-the-fly to '../', then the page loads correctly.

Do you have control over this load.js @rgaudin?

rgaudin commented 1 year ago

Yes https://github.com/openzim/warc2zim/blob/main/src/warc2zim/templates/load.js I believe ; is that it?

Jaifroid commented 1 year ago

Yes, that seem to be the one! I don't make any promises about the fix, I only did a quick test and only on one browser, manually substituting ".../" for "proxy:../" on that line on-the-fly. There could be other things causing the issue, or this could introduce an unforeseen issue, but it's the only clue I currently have based on what is causing the exception in a Chromium browser. If you could try editing that file and producing a test scrape, we could at least test the theory empirically.

It looked to me as if "proxy" might have intended to be a variable, but got hard-coded here. On the other hand someone must know what "proxy" is referring to, whether it had some utility at some time, or was intended to be substituted server-side. What's sure is that a URL such as proxy:../courseslumenlearning.com.... is not a secure protocol, and it is probably malformed.

There is still the puzzle that this error manifests mostly in Chromium browsers from what I can tell, I couldn't reproduce the error in Firefox. But it may be that Firefox handles the error more gracefully, returning something (inconsequential) from the server, or that the Service Worker doesn't intercept the "proxy:../..." request in Firefox, so ignores it. Your guess as good as mine.

ikreymer commented 1 year ago

The proxy: prefix was intended to indicate that it needs to use the URL as a proxy, instead of loading from a WACZ/WARC file, which is what the sourceUrl was originally intended for. Seems like there's just a bug somewhere where it's not being interpreted correctly, possibly fixed in a new version of wabac.js. If so, just bumping to latest version may fix this. (If there's a way not to bake the service worker into the ZIM file, that would be even better). Is there a consistent repro for this currently? Loading https://library.kiwix.org/content/courses.lumenlearning.com_en_all_2021-03/ worked for me..

Jaifroid commented 1 year ago

@ikreymer The error seems to manifest after the Service Worker is installed and is controlling the origin, and it only affects the pre-landing page, i.e. the one that loads load.js. It also seems to affect Chromium browsers much more consistently (were you using Firefox?)

I have a consistent repro on Chromium (in my case, Edge Chromium). Ensure you are not in an InPrivate / Incognito session, then:

Visit https://library.kiwix.org/?lang=eng
Click on Boundless Courses (for me, this is currently the second square)
If the page shows correctly, browse to a second page and ensure you have not ended up browsing the original site, i.e. that you haven't ended up on courses.lunenlearning.com (this occasionally happens, e.g. when browsing incognito, and indicates the Service Worker is not controlling the page)
If the page shows "Sorry, the URL https://courses.lumenlearning.com/catalog/boundlesscourses is not in this archive, then you've repro'd.
Either way, go back to the landing page of this ZIM, open DevTools, go to Application -> Service Workers and Unregister the SW for lumenlearning (see screenshot).
Reload the page (Ctrl-R), and the landing page should show correctly, but it is not being controlled by any Service Worker.
Click on the "Go to Welcome Page" button:
Click on the Boundless Courses tile again. I consistently get:

Jaifroid commented 1 year ago

So, @rgaudin, we should try bumping wabac.js first to see if it handles the proxy: prefix before we try other things.

For the record, I have further information on the on-the-fly fix. However it can't be done in load.js, and changing that object doesn't allow the landing page to load. Instead, I changed this.sourceUrl in the line shown here in sw.js (pretty-printed version):

Changing the value to "../" (without proxy:) in this line fixes the site "permanently", because the retrieved source is cached for offline use. However, deleting the Service Worker and reloading without editing that variable causes the bug to return. To show this in action, reliably being reproduced, I made this video showing the effect of the on-the-fly fix (click to enlarge):

reproduce_and_fix_zimit_error

ikreymer commented 1 year ago

@Jaifroid thanks for the detailed repro steps, unfortunately, this just does not repro for me, on mac with Chrome Canary or any other version. Perhaps something's changed in the latest or on different platforms? The SW gets reinstalled on step 10 and everything loads correctly. The proxy: prefix is getting correctly removed before it gets to resolveHeaders, which is what should be happening.. very strange..

Jaifroid commented 1 year ago

And I've just reproduced on Firefox Developer Edition, but not with the above ZIM. I reproduced with https://library.kiwix.org/viewer#internet-encyclopedia-philosophy_en_all_2022-08 . Curiously the exact same URL works fine on Edge Chromium (which displays this error with the Lumen Courses ZIM).

There must be some race condition, probably an issue with async / await code, as it is usually race conditions that give such inconsistent and difficult to repro/debug results. We may have narrowed it down to the code that is intended to catch the proxy: prefix. When I open the lumen courses ZIM on a fresh (Guest) profile (not InPrivate) in Edge Chromium, the variable this.sourceUrl ALREADY contains "../" instead of "proxy:../", and the landing page loads correctly. Same browser, different profile.

I have also noticed this error manifesting, inconsistently, after browsing several Zimit-based ZIM archives on library.kiwix.org. Could there be interference from the different installed Service Workers, which all show as part of the same domain, albeit (hopefully) scoped to the ZIM's directory?

Jaifroid commented 1 year ago

I know this doesn't get to the source of the problem, but an "empirical" solution might be to add a patch to the Service Worker. A simple line:

this.sourceUrl = this.sourceUrl.replace(/^proxy:/, '');

inserted between lines 10275 and 10276 (in pretty-printed version, obviously line numbers would be different in the source) would fix cases where the proxy: hasn't been removed, and would be completely harmless in cases where it has already been removed.

Of course it would be better to fix the race condition (which could even be caused by differences in network access speeds). But unless anyone has a better idea...

Finally, @ikreymer you suggested that it would be better if the Service Worker were not included in the ZIM. However, that implies that each reader/client should have its own copy and its own implementation. There are pros and cons. In an ideal situation, I would see Zimit/WARC reading as something that should be built in to libzim (but it wouldn't be the JS implementation in that case).

ikreymer commented 1 year ago

I think I found the source of the issue, which is due to an unfortunate combination of a service worker per ZIM file, multiple versions, and multiple collections in IndexedDB, and difference between SW and IndexedDB scoping.

The latest version of zimit is using wabac.js 2.12.0 which made a number of changes (https://github.com/webrecorder/wabac.js/pull/68) which included changing where the proxy: prefix is removed.

wabac.js has a concept of collections, and the way we use it normally so that one SW can control multiple paths. However, the zimit use case instead install a new service worker at each path, and each one considers itself to be root. But while service workers are scoped by path, IndexedDB is by origin.

The result is that a new indexeddb entry, marked as root is created for each Zim file. Here's an example with two Zim files, one with 2.12.0 (which keeps the proxy: prefix in the sourceUrl) and one for <2.12.0, which does not.

Were it not for that change, the configs would be identical, but what happens now is that each service worker under /internet-encyclopedia-philosophy_en_all_2022-08/ considers itself to be root, and loads both configs, and the service worker under /courses.lumenlearning.com_en_all_2021-03/ is also considered to be root, and loads both configs. Since they're both root, one config overrides the other, so whichever one gets loaded second is what ends up being used. As a result, when having a zim from old version and one from new version, if the wrong config is loaded, this error will occur.

There's a few ways to fix this going forward, most probably making sure that only one root config is loaded, and checking the path.

I'm not sure what to do with existing zim files, though. Probably redoing them with latest wabac.js (after the fix) would be the main option.

As mentioned before, having the service worker in the zim I think is a mistake - it should be in the viewers that support this type of replay, so that the system can be updated by updating the viewer, not the zim. Otherwise, there is not a way to fix something like this is if an issue is discovered later, and recreating large ZIMs would be time-consuming. I would strongly advocate for moving to that setup (having the wabac.js be part of the viewer, not the ZIM)

kelson42 commented 1 year ago

@ikreymer Thank you very much, if this is fixed in a future version of webac.hs/warc2zim, this is already great.

Jaifroid commented 1 year ago

@ikreymer That's good news! Your explanation is closely consistent with my experience, and experiences reported by others. When I have used a browser to load several different WARC-based ZIM archives from library.kiwix.org I was getting this error with some of the archives, but if I opened a new fresh (guest) browser profile to test a particular ZIM, the error disappeared.

I've now been able to reproduce this error on the Android app with local ZIM archives. I opened an "old" Zimit archive (an MDN Web Docs ZIM from 16th May), and then opened the latest Internet Encyclopedia of Philosophy ZIM (14th August). The latter now displays the exact same error, fully consistent with your explanation above:

Jaifroid commented 1 year ago

Regarding how to fix: the priority is probably to get some code into new ZIM archives that recognizes the situation and can empty the database and reinitialize. It would go a long way towards mitigating the problematic interaction between older ZIM archives and the latest ones. New scrapes can be scheduled monthly of some of the older archives, and so over time the problem will fade away.

Regarding the broader suggestion of having readers responsible for implementing the Replay system, I'd defer to @rgaudin, but just make these observations:

We only have one app capable of running the Replay system directly -- Android --, and presumably it wouldn't be too hard for it to include a local copy of the Replay files and use those instead of those found in the ZIM;
The Kiwix JS implementation is a functional but not perfect custom version based in the reader. My medium-term aim would be to find a way to use the Replay system fully. This would necessarily be an implementation based in the reader, patched to be called from our existing Service Worker, since we cannot technically run both Service Workers for the same scope;
Kiwix Serve aims to be a fully transparent server so that anyone with a browser can access ZIM archives as if they were accessing the original site (also essential for the hotspot functionality), so this one is a bit more problematic, but presumably not impossible.

rgaudin commented 1 year ago

Thank you both really much ; it's a relief to finally understand what's going on.

I think the realistic approach would be:

Fix wabac.js so it only loads one config
Update warc2zim+zimit to use that new version
Redo our collection of zimit-based file. We don't have many but we know some won't work.
Integrate this into the discussion about the next zimit iteration. Is it on the agenda for next hackathon?

kelson42 commented 1 year ago

@ikreymer Any timeline to fix wabac.js

ikreymer commented 1 year ago

The wabac.js end of it should be fixed (see: https://github.com/webrecorder/wabac.js/pull/104) removing the proxy: for backwards compatibility and turns out it was simple to add a check to avoid overriding the root. Tested new ZIMs with old ZIMs that don't require proxy: and issue seems to be resolved. Will need a new warc2zim with 2.15.0 and new zimit, so keeping this open until that's released.

kelson42 commented 1 year ago

@ikreymer Great!!! Tahnk you! @rgaudin Can you fix please the warc2zim part?

ikreymer commented 1 year ago

There's a few ways to fix this going forward, most probably making sure that only one root config is loaded, and checking the path.

Turns out the fix was actually simpler than I had thought initially: the system was already setup to load the correct root config, passed via &root=<id>: https://github.com/openzim/warc2zim/blob/main/src/warc2zim/templates/load.js#L31

Unfortunately, this was being overridden when the configs were loaded, now fixed with this: https://github.com/webrecorder/wabac.js/pull/104/files#diff-d77b6d5c02f758febcac35d470f4394d5465220a9bcfb051c47fd580e14fdf0dR196

The missing check here is the main issue, combined with changing where the proxy: prefix was being removed.

rgaudin commented 1 year ago

Updated warc2zim ; current :dev zimit now uses it. Let's make some ZIMs and test.

openzim / zimit

Works only on fresh browser #154