openzim / devdocs

devdocs.io to ZIM scraper
GNU General Public License v3.0
2 stars 0 forks source link

Full-text search is not working #13

Closed benoit74 closed 1 month ago

benoit74 commented 1 month ago

With a ZIM from https://github.com/openzim/devdocs/pull/10, the full-text search is not working in kiwix-serve and kiwix-apple.

kiwix-serve message : image

sample ZIM (compressed as ZIP so that Github is happy with this attachment): devdocs_lua_5.4.zim.zip

@rgaudin can you help to diagnose what is wrong in the codebase or the ZIM? I've tested with pylibzim 3.4 and 3.5 (just in case 3.5 introduced a problem)

josephlewis42 commented 1 month ago

I think this one is because I've set the following in that PR:

        # Disable indexing because it won't be available in the JS frontend
        # and causes significant performance issues with rendered sidebars.
        creator.config_indexing(False)

I disabled it because:

  1. It was extremely slow for some resources. I suspect that's because of embedded images or because it's indexing the navbar for every page. (Is there a way to feed just the text you want indexed with a resource along with the full resource to prevent libzim from trying to extract the text itself? I don't think I saw anything like that when I was looking at the library.)
  2. DevDocs also doesn't have a full text search and I didn't want to add a feature only to take it away if we went the SPA route.

I think the right way to approach the UI for this is to make the sidebar JS but leave each documentation page as a separate page. That way the built-in search will work and we won't have to muck around with rewriting links in the HTML. But, that also means this won't be an SPA like some of the newer docs. What do we think about that approach?

benoit74 commented 1 month ago

There is now a convenient way to pass custom content for indexing. With recent (4.0) zimscraperlib, you know have a new index_data property on StaticItem or an arg on add_item_for which can be used exactly for this. You should pass a IndexData with at least a title (used for suggestions) and content (used for full-text search). You can see it in use in Youtube scrapper: https://github.com/openzim/youtube/blob/aaeaefe5134599f8ff0d7a341cebb464f1039b03/scraper/src/youtube2zim/scraper.py#L1278-L1283

I think many ZIM users are "used" to this full-text search, so I think it is important to make it work properly. Especially since this search also works across ZIMs with some readers, which might be particularly useful if users have multiple DevDocs ZIMs to search into.

I think that doing the search inside JS would be a waste because it means it will run JS side, where we know that performances are not as good as with Zapian index. And it would be a bit sad to not go to the SPA (but this is a very personal taste). We can have a SPA and proper indexing of individual pages, this is exactly what has been done very recently for Youtube where we have a Vue.JS SPA. See e.g. https://library.kiwix.org/viewer#project-fuel_en/index.html . There is a kind of a trick to create "fake" ZIM entries with only a redirect so that we can search for them in suggestion and full-text search, but this is probably going to be supported in libzim and readers at some point, so that we do not even need to create HTML content, the readers will directly redirect to proper entry/URL. We can discuss live about it if needed, it is a bit complex tbh.