kiwix-serve indicates that the served item is marked "is_front"

benoit74 commented 7 months ago

I apologize in advance for my limited libzim / libkiwix expertise, you might have to rephrase everything below.

In python-scraperlib, when adding items to the ZIM we can mark them as "is_front" so that they are used for suggestions / searches while other items are ignored (and if not passed, this property is also computed dynamically based on the content type).

For the offspot/metrics project, we might need to detect if the HTTP web response of kiwix-serve is for an "asset" (is_front is False) or for a "page" (is_front is True). Because we would like for instance to count the number of pages visited per period.

This would typically be possible if the is_front property is stored in the ZIM (not sure it is the case) and returned by libkiwix / kiwix-serve as a response header.

Is this or would this be possible?

benoit74 commented 7 months ago

Oh and we would probably also need the item title. Is it possible as well?

rgaudin commented 7 months ago

It's not, this is a creator only info.

rgaudin commented 7 months ago

Close it right away to make it look dramatic 😂

rgaudin commented 7 months ago

You may want to look at https://libzim.readthedocs.io/en/latest/

rgaudin commented 7 months ago

Actually the front (not really but similar, see libzim doc) articles can be found in the Listings

kelson42 commented 7 months ago

Usually logs don't save HTTP response headers, you plan to tweak the reverse proxy logging feature?

As workaround can we just get logs saved only if it ends with ".html" or no extensions at all?

benoit74 commented 7 months ago

Caddy logs do contain request and response headers by default (when using the proper structured format): https://caddyserver.com/docs/logging

No tweaking needed, this is the default / recommended configuration.

We will mostly not store these logs, only process them on the fly. Mostly because for simplicity and resilience reasons, we will in fact store Caddy logs files, but we will keep only 2 rotating files of 1Mo each, and delete the "not current" file after 48h if not already rotated again (numbers could still change, this has not been heavily discussed so far, but this is the "spirit").

kelson42 commented 7 months ago

I guess, this feature request can be implemented... but how will youndo for other kinds (not based on kiwix-serve) of content? I wonder if we could not have something different and more generic to handle this properly.

rgaudin commented 7 months ago

I think you misunderstood something : everything is proxies so we have CT for all requests to all apps

kelson42 commented 7 months ago

I think you misunderstood something : everything is proxies so we have CT for all requests to all apps

But you can not set this specific header for all the content (Edupi for example). Therefore how will be made the front/resource distinction then?

rgaudin commented 7 months ago

I thought the discussion was to use CT instead of is_front. I must have missed some comments

kelson42 commented 7 months ago

I thought the discussion was to use CT instead of is_front. I must have missed some comments

I don't know what you mean with "CT"

benoit74 commented 7 months ago

There is basically two approaches from my understanding to detect "Package page" views for offspot/metrics, based on reverse proxy logs:

rely on content-type to decide that an HTTP request/response is for a "Package page", i.e. something that we want to track in metrics in terms of number of views (for now)
- this is what we have planned / implemented so far
- pro:
- versatile, could be implemented for all apps
- cons:
- we only have the URL, so not really nice or even not useful at all
  - e.g. for EduPi the URL is only a technical ID of the document retrieved, starting at 1 for first document uploaded, ..
  - ZIMs URLs are not always very explicit, ...
- it is hard to decide what is a "Package page" only based on content-type (e.g. should we include PDFs ? ePubs ? Videos ?
- for ZIMs, we have somehow already decided what is a "Package page" at ZIM creation time with the is_front property and we do not benefit from it
find another alternative
- this is the topic of this ticket
- pros:
- reuse something which is already decided in scrapers (is_front), and benefit from their logic (will always be more specific / fine-tuned than just a content-type)
- probably possible to also access the real "Title" of the "Package Page", instead of just a technical URL
  - even non sensitive to scraper changes in terms of URL structure in the ZIM
- no business logic in metrics to decide what is a "Package Page"
- business logic is tied to the ZIM, so probably easier to evolve if needed
- cons:
- only work for ZIMs
- obviously a change is needed in libkiwix / kiwix-serve ^^

My question was first to check there was not something already feasible / implemented without code changes. And then to investigate if it is meaningful to make a change (or if we "live with what we have", at least for now).

I already have an answer to the first part of my question, which is great, thank you.

rgaudin commented 7 months ago

Versatile doesn't prevent us from having a better support for ZIM where we have control.

Regarding urls in scrapers, you are well aware that we do ways to make it human readable.

Sure we could also embed the entry title in headers but sending that both in body and in headers is not appealing

kelson42 commented 7 months ago

@rgaudin @benoit74 Thx, seems very clear to me now. your proposla of header seems OK to me, I don't really have a better idea.

Considering:

https://thecodersblog.com/custom-header-naming-convention-http-practices-conventions
we might be interested to use this system to transmit other page metadata

What would be a very concrete proposal of header name/value(s)?

rgaudin commented 7 months ago

Being a dinosaur, I'd use the X- prefix. MDN docs says:

this convention was deprecated in June 2012 because of the inconveniences it caused when nonstandard fields became standard

There is no chance for ours to ever become standard so I think we can use X- prefix or not. Whatever you prefer

X-ZIM-Title: xxx
X-ZIM-FrontArticle: true/false

openZIM-Title: xxx
openZIM-FrontArticle: true/false

mgautierfr commented 7 months ago

Actually the front (not really but similar, see libzim doc) articles can be found in the Listings

While it is technically true, it may not be the best way to get the information. Having the article in a list (and get the information from there) would mean that we do a search for every resources in this list to know if it is front or not. It mostly double the work (and time) to locate a resources (not including decompression). If this information is used only for (our) metrics/stats, I'm not sure it worth it.

If we go this way, it would be better to move with supporting generic headers (which could be used by zimit2). Depending on how we implement it, we would still have to do a second entry lookup, but it would at least be generic and not only for us.

It also may be merged with the generic metadata (partly explained, but never implemented in https://github.com/openzim/libzim/issues/325) features.

find another alternative

this is the topic of this ticket

Another other idea (relevant or not) : Make metrics ask the "zim file" if the url is a front or not. When metric detects (by heuristics) that a url may be front article, it opens the zim file itself and searches for information in it.

rgaudin commented 7 months ago

Make metrics ask the "zim file" if the url is a front or not. When metric detects (by heuristics) that a url may be front article, it opens the zim file itself and searches for information in it.

I initially thought that was what @benoit74 wanted to do. We could export a list of front articles for every ZIM in a sorted list or another fast-access format that metrics could query.

benoit74 commented 7 months ago

Header naming

I second idea of keeping the X- prefix, these headers will never make it to an international standard

Regarding naming the header(s), we might also consider that:

the need comes from offspot/metrics
we would like to be able to track page views of other applications than kiwix-serve

Then I would propose to add only one header X-Offspot-Page-Viewed-Name which:

will contain a user-friendly label of the viewed page name
will only be set when a page has to be tracked (i.e. when it is a front matter for ZIMs, not for JS/CSS, and usually not for PDF/ePub/...)

Who does what

I don't mind if we decide that it is preferable to not implement this in libkiwix and only export the list of front articles. In any case, the computation will be done somewhere.

More global insight

As mentioned in https://github.com/offspot/metrics/issues/33#issuecomment-1827502477, the more we dive into this issue, the more doubts I have about the real user need for this.

kelson42 commented 7 months ago

@mgautierfr Can we not just efficiently returns the article title in a HTTP response header from the dirent (as the dirent is anyway read if you return the content).

At this stage this is I believe the way forward: just always return the article title as http header, cheap and straight forward. If no title in the header, then metrics can consider that this is a resource and not a front article.

mgautierfr commented 7 months ago

I like the idea. And it is pretty straight forward. But there is a catch here (which may be a problem or not. If you don't care, I don't care too):

The way we store title in zim file, we cannot know if entry has not title (image, video) or if title is same as url. So we have two way to do:

Always set a title header (and set it to url for entry without title)
Set a title only if it is different that url (and so lost track of entry with title == url)

kelson42 commented 7 months ago

The way we store title in zim file, we cannot know if entry has not title (image, video) or if title is same as url. So we have two way to do

This is an optimisation hack for the title index search. We should be able from the dirent directly to know that. If not possible we should allow this.

benoit74 commented 7 months ago

I don't get why we should mind about the situation where title is same as URL. If something (scraper, whatever) decided to use the URL as a title, then this is the title. And I don't mind if we blindly return this title in an HTTP response header.

The user of this information will hence be able to apply its own logic if he feels like a title identical to the URL is acceptable or not.

Typically we could imagine to use this information as a heuristic in offspot/metrics to detect which requests are most probably a "front matter" and which ones aren't (even if I'm still not convinced that we won't have scraper which will set a title equal to file name for instance on some assets ... but this is something we could have control on).

mgautierfr commented 7 months ago

This is a space optimization hack. If title is same than url, we only store the url in the dirent. So, at reading time, if the dirent contains only a url, we don't know if the dirent was created with a title same as url or without a title (empty title counts as without a title).

kelson42 commented 2 months ago

I have open a ticket at libzim to get this feature: https://github.com/openzim/libzim/issues/885

kiwix / libkiwix