kiwix / libkiwix

Common code base for all Kiwix ports
https://download.kiwix.org/release/libkiwix/
GNU General Public License v3.0
112 stars 54 forks source link

Implement OPDS "search" sorting #702

Open kelson42 opened 2 years ago

kelson42 commented 2 years ago

Currently there is no sorting at all. The results should be sorted descending by popularity. This meaan that if we only filter for ZIM in French, it will just return all the French content descending by popularity. If we search for French content and text pattern is "Wikipedia" then it will return only the matching pattern sorted like Xapian does, but at last criteria the popularity will be taken in account (we should get all the ZIM with "wikipedia" in title or description sorted by popularity).

Once we will have introduced the attribute of popularity this will be necessary, see #489

But there is another need. I wanted to be informed about the latest ZIM published via OPDS and I remarked that rhere is no sorting by date (descending)... and there is not even a way to filter with a creation date range.

veloman-yunkan commented 2 years ago

Before implementing this enhancement we need to make sure that it plays well with the OPDS spec. The spec mentions three fields that have to do with dates:

OPDS Catalog Entries must include an atom:updated element indicating when the OPDS Catalog Entry was last updated. A dc:issued element should be used to indicate the first publication date of the Publication and must not represent any date related to the OPDS Catalog Entry.

OPDS Catalog Entries may use atom:published to indicate when the OPDS Catalog Entry was first accessible.

Thus

While doing this small research I found out that in our OPDS streams we populate the atom:updated field with the book creation date (which is against the spec):

https://github.com/kiwix/libkiwix/blob/dc4f9a4939eef6e227fae81cb5fb46e527157b9d/src/opds_dumper.cpp#L73-L97

Now the question is - should we fix the inconsistency with the usage of the atom:updated field and put the ZIM file creation date in a dc:issued node instead?

kelson42 commented 2 years ago

@veloman-yunkan Thank you for this research work, this is englighting!

Before implementing this enhancement we need to make sure that it plays well with the OPDS spec. The spec mentions three fields that have to do with dates:

OPDS Catalog Entries must include an atom:updated element indicating when the OPDS Catalog Entry was last updated. A dc:issued element should be used to indicate the first publication date of the Publication and must not represent any date related to the OPDS Catalog Entry. OPDS Catalog Entries may use atom:published to indicate when the OPDS Catalog Entry was first accessible.

Thus

* `atom:updated` is the date corresponding to the OPDS Entry (rather than the publication associated with it). In our case this should be the time when the book was added to the library duringThe library loading or when the entry was updated during library reloading.

If I understand properly, if we restart kiwix-serve, then all these values will be reseted. I hardly see if this works like this how this could be useful at all, actually it would be pretty misleading IMO.

The only scenario I can imagine is that this is the same file, but a few metadata have been changed. A situation which does not happen now, but will happen once the CMS will be in production. In such a scenario, it is impossible for the libkiwix/kiwix-serve to know that something has changed (because of lack of persistent memory if kiwix-serve is restarted). This should be handled in library.xml.

* `atom:published` is the earliest value that `atom:updated` had for this OPDS feed entry. In our case this should be the time when the book was added to the library during library loading.

OK, but IMO this value can only be set by the CMS and not automatically handled by libkiwix/kiwix-serve.

* `dc:issued` is the time when the actual publication was issued. It is unambiguous for publications that have only one of the hardcopy or digital representations. However, if we consider a paper publication that was then digitized or a book that was first published online and printed on paper later should we treat the hardcopy and the digital version as different representations of the same publication or as two different publications? I think we can use it to represent the creation date of ZIM files (though if a ZIM file represents a single real-world publication then we hit the mentioned interpretation problem).

I think it should be the time when the ZIM is created, but your questionning is really pertinent and concrete to me. We should IMO track it in openzim/libzim or openzim/overview (we would need to update the ZIM specification).

While doing this small research I found out that in our OPDS streams we populate the atom:updated field with the book creation date (which is against the spec):

https://github.com/kiwix/libkiwix/blob/dc4f9a4939eef6e227fae81cb5fb46e527157b9d/src/opds_dumper.cpp#L73-L97

Now the question is - should we fix the inconsistency with the usage of the atom:updated field and put the ZIM file creation date in a dc:issued node instead?

Yes, this is wrong to my opinion too. It should be fixed.

veloman-yunkan commented 2 years ago

Now the question is - should we fix the inconsistency with the usage of the atom:updated field and put the ZIM file creation date in a dc:issued node instead?

Yes, this is wrong to my opinion too. It should be fixed.

Should we fix it both in /catalog and /catalog/v2 OPDS feeds or only in the latter?

veloman-yunkan commented 2 years ago

Should we fix it both in /catalog and /catalog/v2 OPDS feeds or only in the latter?

In #715 I added <dc:issued> to both legacy (/catalog) and current (/catalog/v2) OPDS feeds.

kelson42 commented 2 years ago
  • dc:issued is the time when the actual publication was issued. It is unambiguous for publications that have only one of the hardcopy or digital representations. However, if we consider a paper publication that was then digitized or a book that was first published online and printed on paper later should we treat the hardcopy and the digital version as different representations of the same publication or as two different publications? I think we can use it to represent the creation date of ZIM files (though if a ZIM file represents a single real-world publication then we hit the mentioned interpretation problem).

I think it should be the time when the ZIM is created, but your questionning is really pertinent and concrete to me. We should IMO track it in openzim/libzim or openzim/overview (we would need to update the ZIM specification).

I have created a ticket to track this idea at https://github.com/openzim/overview/issues/9

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.