openzim / libzim

Reference implementation of the ZIM specification
https://download.openzim.org/release/libzim/
GNU General Public License v2.0
165 stars 49 forks source link

Remove HTML parsing from our source repository #377

Open kelson42 opened 4 years ago

kelson42 commented 4 years ago

Since the really beginning - almost 15 years ago - we parse the HTML document in the writer based on Xapian omindex code.

Unfortunately, because Omindex was and still is command line tool, I had to copy the interesting part of Omeindex' code and put it in our repo. I had to slightly patch it, but I don't believe there is anything really necessary (anymore?) here.

This is a bit of a problem because:

At the core of the problem we have Xapian which does not provide this feature within a library. There is a ticket on there side about it. The ticket is old a suprisingly it does not seem to be a priority. We might have to consider to help Xapian on this.

mgautierfr commented 4 years ago

Ideally, we shouldn't parse the html at all in libzim. A good text to index is not just the html content without the tags. It is a curated text with all the chrome (menu, header, footer, ...) removed. For more specific content we may remove even more (remove the reference for a wikipedia article for example). libzim cannot do that and it would be to the user code to do it as it knows about the context.

The new writer api will allow the user code to provide its own content to index instead of letting libzim using the main html content. I will keep the html parsing for compatibility only but it will be simply removed at a time.

kelson42 commented 4 years ago

@mgautierfr I'm in favour of keeping the libzim as lean as possible. We slowly realise (with Zimit) that allowing libzim users a bit more freedom about ft index creation is necessary. Therefore, your last comment goes IMO in the right direction. But things still need to be easy to use, so we need to think twice about the alternative if the document parsing is not in the libzim anymore at some point. What is sure is that this is another topic/ticket than this one and I would like to see a ticket with the problem we want to fix and a discussion about it before we change this in the libzim.

mgautierfr commented 4 years ago

What is sure is that this is another topic/ticket than this one

This is related as I'm not sure we must investigate on html parsing (find a new tool, update our code) if we plan to remove it and keep it only for compatibility.

I would like to see a ticket with the problem we want to fix

openzim/mwoffliner#1725 seems good for that.

and a discussion about it before we change this in the libzim.

I was planning to do this in the libzim_next api change. See openzim/libzim#364 and "Small Writer Changes" part in openzim/libzim#325

kelson42 commented 4 years ago

@mgautierfr openzim/libzim#325 talks about that clearly indeed:

getIndexingContent. For now, we are indexing the same content as the displayed content (minus the html tag).
This is not always efficient. (We may not indexing the sources/reference/external links of wikipedia article. Or not the "related" questions in stackoverflow...)
But we cannot do this on libzim side. The "selection" of the text must be done on the scrapper who know the context.
This would potentially allow use to index other content than html (such as image or video) as we dissociate the content of what is indexed.

But there is no clear explanation about neither the problem nor the solution. Same in openzim/libzim#364 where it is not clear at all if getIndexData is optional or mandatory. No problem to have this as optional, would be great. But I disagree to remove it without an easy to use alternative.

Therefore I want to have first a clear description about problem/solution and an agreement on this topic if you want to follow that path.

mgautierfr commented 4 years ago

How For now, we are indexing the same content as the displayed content (minus the html tag). This is not always efficient. (We may not indexing the sources/reference/external links of wikipedia article. Or not the "related" questions in stackoverflow...) But we cannot do this on libzim side. (or openzim/mwoffliner#1725) is not a clear explanation of the problem ?

But I disagree to remove it without an easy to use alternative.

As I said, I will keep the html parsing for compatibility. It we be the default implementation of getIndexData. User will be free to re-implement it.

kelson42 commented 2 years ago

Now, with libzim7, we can specify a content to index whoch is different from the content of the article. But this does not resolve the primary purpose of this ticket.

kelson42 commented 1 year ago

The Xapian team has moved on on their side and even if we still don't have a libomindex, the code structure/feature of omindex has evolved to something modular.

omindex indexer is not only about HTML, but actually has the ambition to (and already does) support many different formats. To achieve that, the Xapian team wants to rely as much as possible to external libraries. To a large extend, this is a similar problematic that we have in the libzim, just at an another level.

Beside resolving this ticket, relying on an external libomindex would allow to have a native support for all file formats supported by omindex, PDF in a first line (see #289 and https://github.com/openzim/gutenberg/issues/95). The overall goal is not to provide everything here, but like explained by @mgautierfr, to have a baseline indexation feature so that libzim user don't have to (but stil can) implement their own document parser to benefit of fulltext search.

One important point is that we need actually to retrieve the token from libomindex but we want to do the pure indexation ourself. So this is only the parsing part we are interested int, primarely for HTML.

I would like to assess now if we can move forward on this which concretly means (1) sponsorised the create of a libomindex (2) Once created use it

ojwb commented 1 year ago

omindex indexer is not only about HTML, but actually has the ambition to (and already does) support many different formats. To achieve that, the Xapian team wants to rely as much as possible to external libraries.

Incidentally, there's now support for EPUB (using libgepub which is maintained by GNOME: https://gitlab.gnome.org/GNOME/libgepub).

One important point is that we need actually to retrieve the token from libomindex but we want to do the pure indexation ourself. So this is only the parsing part we are interested int, primarely for HTML.

Not quite sure what you mean by "retrieve the token", but I think you're wanting to get the text that's extracted rather than having it indexed for you, right? So more of a "libomextracttext" rather than "libomindex".

The current internal interface for the worker subprocess extractors is that you pass in a filename (I understand you'd want to be able to pass a buffer instead) and a MIME content-type string, and get out separate text strings for the document body, title, keywords, author (or from for email), to, cc , bcc, message-id, an integer for number of pages, and a time_t for the created timestamp (for most formats only a subset of these fields are available). You can just ignore some or all of the metadata if you aren't interested, though if so maybe we should consider telling the worker subprocess so it can optimise and avoid the overhead of extracting and sending data that's going to be ignored.

The HTML parser has probably evolved a bit since you took a copy, but can be passed a buffer to give a subset of those metadata items. If you're using a custom subclass that still has the same basic approach of calling a method for each open tag, close tag, and content between tags.

kelson42 commented 1 year ago

One important point is that we need actually to retrieve the token from libomindex but we want to do the pure indexation ourself. So this is only the parsing part we are interested int, primarely for HTML.\n\nNot quite sure what you mean by \"retrieve the token\", but I think you're wanting to get the text that's extracted rather than having it indexed for you, right? So more of a \"libomextracttext\" rather than \"libomindex\".

Yes, the text usable directly by the indexer (without html tags for example).

ojwb commented 1 year ago

I've been looking at what needs doing, and have a few further questions.

I understand HTML is your primary format of interest, and both PDF and EPUB have been mentioned too. Are there other formats that are of enough interest to you that I should be aware of them at this point?

(The formats omindex already has workers for should generally just work, but there are some formats that are extracted by running an external program (e.g. catdvi for TeX DVI files) and some currently have hard-coded handling in omindex (e.g. PostScript). If one of these is important to you I need to look at moving that handling into a worker.)

You want to be able to pass the data in a buffer rather than in a file. Would it work for you if that buffer was required to be allocated by "libomindex"? If the library does the allocation it can be shared with the worker via mmap() or similar and we can avoid the need to copy the data at all, whereas a caller-allocated buffer would require copying. (To achieve zero-copy here we also need a way to pass that buffer to the library - for PDFs poppler can just be passed a buffer, but for EPUBs it's not so simple - libgepub can extract from a seekable fd and I have some ideas to try for that, but the fallback solution would be to internally write the buffer to a temporary file, which we'll need to do for some formats on some platforms anyway.)

Are you interested in metadata (title, author, etc) and if so what items? We're limited in some cases by what the library we're using supports, but knowing what's of interest means I can try to ensure those are extracted when possible.