kiwix / libkiwix

Common code base for all Kiwix ports
https://download.kiwix.org/release/libkiwix/
GNU General Public License v3.0
112 stars 55 forks source link

kiwix-serve indicates that the served item is marked "is_front" #1026

Open benoit74 opened 7 months ago

benoit74 commented 7 months ago

I apologize in advance for my limited libzim / libkiwix expertise, you might have to rephrase everything below.

In python-scraperlib, when adding items to the ZIM we can mark them as "is_front" so that they are used for suggestions / searches while other items are ignored (and if not passed, this property is also computed dynamically based on the content type).

For the offspot/metrics project, we might need to detect if the HTTP web response of kiwix-serve is for an "asset" (is_front is False) or for a "page" (is_front is True). Because we would like for instance to count the number of pages visited per period.

This would typically be possible if the is_front property is stored in the ZIM (not sure it is the case) and returned by libkiwix / kiwix-serve as a response header.

Is this or would this be possible?

benoit74 commented 7 months ago

Oh and we would probably also need the item title. Is it possible as well?

rgaudin commented 7 months ago

It's not, this is a creator only info.

rgaudin commented 7 months ago

Close it right away to make it look dramatic 😂

rgaudin commented 7 months ago

You may want to look at https://libzim.readthedocs.io/en/latest/

rgaudin commented 7 months ago

Actually the front (not really but similar, see libzim doc) articles can be found in the Listings

kelson42 commented 7 months ago

Usually logs don't save HTTP response headers, you plan to tweak the reverse proxy logging feature?

As workaround can we just get logs saved only if it ends with ".html" or no extensions at all?

benoit74 commented 7 months ago

Caddy logs do contain request and response headers by default (when using the proper structured format): https://caddyserver.com/docs/logging

No tweaking needed, this is the default / recommended configuration.

We will mostly not store these logs, only process them on the fly. Mostly because for simplicity and resilience reasons, we will in fact store Caddy logs files, but we will keep only 2 rotating files of 1Mo each, and delete the "not current" file after 48h if not already rotated again (numbers could still change, this has not been heavily discussed so far, but this is the "spirit").

kelson42 commented 7 months ago

I guess, this feature request can be implemented... but how will youndo for other kinds (not based on kiwix-serve) of content? I wonder if we could not have something different and more generic to handle this properly.

rgaudin commented 7 months ago

I think you misunderstood something : everything is proxies so we have CT for all requests to all apps

kelson42 commented 7 months ago

I think you misunderstood something : everything is proxies so we have CT for all requests to all apps

But you can not set this specific header for all the content (Edupi for example). Therefore how will be made the front/resource distinction then?

rgaudin commented 7 months ago

I thought the discussion was to use CT instead of is_front. I must have missed some comments

kelson42 commented 7 months ago

I thought the discussion was to use CT instead of is_front. I must have missed some comments

I don't know what you mean with "CT"

benoit74 commented 7 months ago

There is basically two approaches from my understanding to detect "Package page" views for offspot/metrics, based on reverse proxy logs:

My question was first to check there was not something already feasible / implemented without code changes. And then to investigate if it is meaningful to make a change (or if we "live with what we have", at least for now).

I already have an answer to the first part of my question, which is great, thank you.

rgaudin commented 7 months ago

Versatile doesn't prevent us from having a better support for ZIM where we have control.

Regarding urls in scrapers, you are well aware that we do ways to make it human readable.

Sure we could also embed the entry title in headers but sending that both in body and in headers is not appealing

kelson42 commented 7 months ago

@rgaudin @benoit74 Thx, seems very clear to me now. your proposla of header seems OK to me, I don't really have a better idea.

Considering:

What would be a very concrete proposal of header name/value(s)?

rgaudin commented 7 months ago

Being a dinosaur, I'd use the X- prefix. MDN docs says:

this convention was deprecated in June 2012 because of the inconveniences it caused when nonstandard fields became standard

There is no chance for ours to ever become standard so I think we can use X- prefix or not. Whatever you prefer

X-ZIM-Title: xxx
X-ZIM-FrontArticle: true/false
openZIM-Title: xxx
openZIM-FrontArticle: true/false
mgautierfr commented 7 months ago

Actually the front (not really but similar, see libzim doc) articles can be found in the Listings

While it is technically true, it may not be the best way to get the information. Having the article in a list (and get the information from there) would mean that we do a search for every resources in this list to know if it is front or not. It mostly double the work (and time) to locate a resources (not including decompression). If this information is used only for (our) metrics/stats, I'm not sure it worth it.

If we go this way, it would be better to move with supporting generic headers (which could be used by zimit2). Depending on how we implement it, we would still have to do a second entry lookup, but it would at least be generic and not only for us.

It also may be merged with the generic metadata (partly explained, but never implemented in https://github.com/openzim/libzim/issues/325) features.

find another alternative

  • this is the topic of this ticket

Another other idea (relevant or not) : Make metrics ask the "zim file" if the url is a front or not. When metric detects (by heuristics) that a url may be front article, it opens the zim file itself and searches for information in it.

rgaudin commented 7 months ago

Make metrics ask the "zim file" if the url is a front or not. When metric detects (by heuristics) that a url may be front article, it opens the zim file itself and searches for information in it.

I initially thought that was what @benoit74 wanted to do. We could export a list of front articles for every ZIM in a sorted list or another fast-access format that metrics could query.

benoit74 commented 7 months ago

Header naming

I second idea of keeping the X- prefix, these headers will never make it to an international standard

Regarding naming the header(s), we might also consider that:

Then I would propose to add only one header X-Offspot-Page-Viewed-Name which:

Who does what

I don't mind if we decide that it is preferable to not implement this in libkiwix and only export the list of front articles. In any case, the computation will be done somewhere.

More global insight

As mentioned in https://github.com/offspot/metrics/issues/33#issuecomment-1827502477, the more we dive into this issue, the more doubts I have about the real user need for this.

kelson42 commented 7 months ago

@mgautierfr Can we not just efficiently returns the article title in a HTTP response header from the dirent (as the dirent is anyway read if you return the content).

At this stage this is I believe the way forward: just always return the article title as http header, cheap and straight forward. If no title in the header, then metrics can consider that this is a resource and not a front article.

mgautierfr commented 7 months ago

I like the idea. And it is pretty straight forward. But there is a catch here (which may be a problem or not. If you don't care, I don't care too):

The way we store title in zim file, we cannot know if entry has not title (image, video) or if title is same as url. So we have two way to do:

kelson42 commented 7 months ago

The way we store title in zim file, we cannot know if entry has not title (image, video) or if title is same as url. So we have two way to do

This is an optimisation hack for the title index search. We should be able from the dirent directly to know that. If not possible we should allow this.

benoit74 commented 7 months ago

I don't get why we should mind about the situation where title is same as URL. If something (scraper, whatever) decided to use the URL as a title, then this is the title. And I don't mind if we blindly return this title in an HTTP response header.

The user of this information will hence be able to apply its own logic if he feels like a title identical to the URL is acceptable or not.

Typically we could imagine to use this information as a heuristic in offspot/metrics to detect which requests are most probably a "front matter" and which ones aren't (even if I'm still not convinced that we won't have scraper which will set a title equal to file name for instance on some assets ... but this is something we could have control on).

mgautierfr commented 7 months ago

This is a space optimization hack. If title is same than url, we only store the url in the dirent. So, at reading time, if the dirent contains only a url, we don't know if the dirent was created with a title same as url or without a title (empty title counts as without a title).

kelson42 commented 2 months ago

I have open a ticket at libzim to get this feature: https://github.com/openzim/libzim/issues/885