Closed rgaudin closed 2 years ago
Here what is done :
FRONT ARTICLE and titleListingV1. Article are put in the title listing V1 only if the hint front article is set to 1 (true). Mimetype is not used at all.
Xapian indexing
Article are "title indexed" if getIndexData
return something, it has actually indexData (indexData->hasIndexData()
returns true) and the title (indexData->getTitle()
) is not empty.
As we use the default index data, we index the article if it is html and have a title. We don't care about FRONT_ARTICLE here.
Suggestion If there is a title xapian indexing (always true for new zim create with libzim compiled with xapian), suggestion use the xapian indexing and not the titleListingV1. (Other implementation may do differently (kiwix-js never use xapian))
So the behavior of you test is correct with what is done. FRONT_ARTICLE is correctly use and we are indexing (and retrieving) all html content.
However, we may want to change the behavior :
OK, glad there is no bug but the doc is either incorrect or misleading.
FRONT_ARTICLE
mark entry (item or redirection) as main article for the reader (typically a html page in opposition to a resource file as css, js, …). Random and suggestion feature will search only in entries marked as FRONT_ARTICLE. If no entry are marked as FRONT_ARTICLE, all entries will be used.
I am worried about two things:
The problem is, FRONT_ARTICLE
is pretty expressive, and while not setting it defaulting to whatever implementation we want is OK, setting it should settle the behavior: True
, it's a front-article, I want it to show up in suggestions. False
, it's shouldn't.
Now I understand the problem with search. We are using the same FRONT_ARTICLE
concept to filter entries between those exposed and those that are not. So, following this assertion, suggestion and search should follow the same criteria (assuming data is indexed).
Sketch of expected behavior:
# a creation time
if FRONT_ARTICLE is None:
FRONT_ARTICLE = guess_from_mimetype()
if FRONT_ARTICLE:
add_to_titleListing()
if with_xapian:
add_to_xapian_title_index()
if item.has_index_data:
add_to_xapian_content_index()
# suggest()
if has_xapian_title_index():
find_entries_in_index()
else:
find_entries_in_listing()
# search()
if has_xapian_index():
find_entries_in_index_matching()
what do you think?
Seems indeed a bit hard to understand and hard to know how to simplify :) Should we keep this for the hackathon?
Yes, that's a good idea. Adding it to the Wiki and adapting scraperlib tests to current behavior.
We need first to implement openzim/libzim#642 to then be able to check if we could close this ticket.
And #92 as well
https://github.com/openzim/libzim/issues/642 has been implemented. What should we do with this ticket?
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
@mgautierfr @rgaudin I think (have unterstood) that this ticket is super straight forward. Can we go forward. Who? Should reassign it to @mgautierfr
Tested above code with current codebase and got the result that past-me said was expected.
It seems hat while
get_hints()
is properly called by libzim, its result is either malformed or not used and we always fallback to the mimetype-based default.Output:
We should have three results but the expected ones should be:
f-art1
because it specifically setsFRONT_ARTICLE=True
f-art2
because it uses the default and has atext/html
mimetypef-art4
because it specifically setsFRONT_ARTICLE=True