Consider more reliable and stable detection methods for distinguishing zimit classic and zimit2

kiwix / kiwix-js

Fully portable & lightweight ZIM reader in Javascript

https://www.kiwix.org/

GNU General Public License v3.0

309 stars 135 forks source link

Consider more reliable and stable detection methods for distinguishing zimit classic and zimit2 #1203

Open Jaifroid opened 9 months ago

Jaifroid commented 9 months ago

As suggested here, we could look for 'warc2zim' AND 'zimit' strings in Scraper metadata (we currently only look for 'warc2zim', but it's not currently guaranteed to be stable), and if '_sw:yes' is not in tags, then it's zimit2. If it's there, then there is a Service Worker, meaning it's zimit classic.

We currently rely on finding 'warc-headers' in the declared MIME type. But it's possible (if currently unlikely) that such headers could be reintroduced if they are needed in future versions of zimit2, so it would be good to have other options as outlined above.

Jaifroid commented 9 months ago

https://github.com/kiwix/kiwix-js/commit/c528c94924086f61a8def8a352bed0bde78943d4 addresses the first part of this issue (adds test for 'zimit' in the scraper name).

kelson42 commented 9 months ago

@Jaifroid The recommended way of doing it is to rely on _sw ZIM tag. Zimit2 should not need anything special at reader level AFAIK. @benoit74 Wonder this not explicit in the documentation of warc2zim.

Jaifroid commented 9 months ago

Thanks, @kelson42 I agree, I just can't use that method yet because all the zimit2 ZIMs produced so far have '_sw:yes'. Until that's fixed as requested by rgaudin, I have to use the current method.

There is a specific requirement in the reader to detect links and PDFs that cannot be opened in the webview or iframe due to sandboxing / CSP. Kiwix Serve has already been patched via libkiwix, and other readers that use libkiwix will have the patch. The issue is that Wombat aggressively rewrites such links, so they can't be detected without either temporarily disabling Wombat or using other workarounds. I've patched both KJS readers.

benoit74 commented 8 months ago

Both changes have been done:

https://dev.library.kiwix.org/raw/solar.lowtechmagazine.com_en_all_2024-02/meta/Scraper : warc2zim 2.0.0-dev2 + zimit 2.0.0-dev1 + Browsertrix crawler 0.12.4
https://dev.library.kiwix.org/raw/solar.lowtechmagazine.com_en_all_2024-02/meta/Tags : _ftindex:yes;_category:other;lowtech

Not all tests ZIMs have been already rebuilt with this latest code change, but at least you have few to test.

Jaifroid commented 8 months ago

@benoit74 Excellent, thanks!