openzim / python-libzim

Libzim binding for Python: read/write ZIM files in Python
https://pypi.org/project/libzim/
GNU General Public License v3.0
62 stars 22 forks source link

Update pylibzim to latest libzim changes #109

Closed rgaudin closed 3 years ago

rgaudin commented 3 years ago

It's been almost a month since the last sync between libzim and pylibzim.

A number of changes have occurred in the mean time on libzim: clusterSize, search, suggestion, snippets, indexes, etc)

I don't know what's relevant but the current code base was updated last for https://github.com/openzim/libzim/commit/02d451f5ea04c1ae87180e98e67efba324d0401e on April 26th

tim-moody commented 3 years ago

I take it this explains my failure to compile python-libzim:

159 | class Search | ^~ libzim/wrapper.cpp: At global scope: libzim/wrapper.cpp:1008:8: error: ‘search_iterator’ in namespace ‘zim’ does not name a type; did you mean ‘SearchIterator’? 1008 | zim::search_iterator __pyx_v_it;

and others

rgaudin commented 3 years ago

Hi @tim-moody, that's right. I'll fix that sometime but in the mean time, you can use the nightly from the CI or just use the latest release (libzim6). If all compatible nightlies are gone (we rotate files), let me know and I'll upload one somewhere

tim-moody commented 3 years ago

@rgaudin are you saying there is a nightly of python-libzim based on libzim7? I don't know where that is.

Also, is it true that kiwix-build does not build python-libzim?

rgaudin commented 3 years ago

http://download.openzim.org/nightly is based on libzim master which is libzim7 and yes, build happens here: https://github.com/openzim/python-libzim/tree/master/.github/workflows

If you want to use pylibzim, the best option is just to pip install it. It would be libzim6 which is our latest release. If you want libzim 7 version, follow the instructions from the workflow but as it downloads libzim binary from that nighty folder ; and because we only keep 30 files max, you have to use another version.

Now this ticket exists because libzim7's API has changes this month and it's possible even the oldest available nightly won't work with pylibzim. We are fixing pylibzim's usage of the API now.

rgaudin commented 3 years ago

@mgautierfr @maneeshpm I'm stuck trying to update the wrapper for the new search API.

here's what I have on Cython side:

wrapper.pyx

def get_estimated_search_results_count(self, query: str) -> int:
    cdef wrapper.ZimSearcher searcher = wrapper.ZimSearcher(dereference(self.c_archive))
    cdef wrapper.ZimQuery zim_query = wrapper.ZimQuery();
    zim_query.setQuery(query.encode('UTF-8'), False);
    cdef wrapper.ZimSearch search0 = searcher.search(zim_query);
    return search0.getEstimatedMatches()

As you now, every type must be defined for Cython so we have wrapper types for zim ones.

lib.h:

class ZimSearch: public zim::Search
{
  public:
    ZimSearch(zim::Search&& s) : zim::Search(s) {}
    int getEstimatedMatches() const
    { return zim::Search::getEstimatedMatches(); }
};

wrapper.pxd

cdef extern from "lib.h":
    cdef cppclass ZimSearch:
        ZimSearch()
        int getEstimatedMatches() except +

Now this doesn't compile with

error: call to implicitly-deleted copy constructor of 'zim::Search'
        ZimSearch(zim::Search&& s) : zim::Search(s) {}
                                     ^           ~
    /Users/reg/src/pylibzim/include/zim/search.h:162:9: note: copy constructor is implicitly deleted because 'Search' has a user-declared move constructor
            Search(Search&& s);
            ^

libzim's search.h has

class Search
{
    public:
        Search(Search&& s);
        Search& operator=(Search&& s);
        ~Search();

        /** Get a set of results for this search.
         *
         * @param start The begining of the range to get
         *              (offset of the first result).
         * @param end   The end of the range to get
         *              (offset of the result past the end of the range).
         */
        const SearchResultSet getResults(int start, int end) const;

        /** Get the number of estimated results for this search.
         *
         * As the name suggest, it is a estimation of the number of results.
         */
        int getEstimatedMatches() const;

    private: // methods
        Search(std::shared_ptr<InternalDataBase> p_internalDb, const Query& query);
        Xapian::Enquire& getEnquire() const;

    private: // data
         std::shared_ptr<InternalDataBase> mp_internalDb;
         mutable std::unique_ptr<Xapian::Enquire> mp_enquire;
         Query m_query;

  friend class Searcher;
};

I am not familiar with this move-constructor thing and don't know how to move on from here.

maneeshpm commented 3 years ago

@rgaudin The explanation of this behavior is more related to the lvalue and rvalue and I believe @mgautierfr is more qualified to comment on that.

In order to actually move the parameter we have to call std::move on it. So please try to change your constructor as :

class ZimSearch: public zim::Search
{
  public:
    ZimSearch(zim::Search&& s) : zim::Search(std::move(s)) {}
    int getEstimatedMatches() const
    { return zim::Search::getEstimatedMatches(); }
};

I hope it will compile fine then.

rgaudin commented 3 years ago

Thanks, unfortunately, Cython's still not happy

 C++ class must have a nullary constructor to be stack allocated
mgautierfr commented 3 years ago

The problem is that we cannot create a "default" zim::Search. By definition a zim::Search is associated to a "database" (zim::Searcher) and a query (zim::Query).

The solution is to store a pointer to a zim::Search instead of a zim::Search itself. This is what is made for Entry as zim::Entry has not default constructor neither.

Btw, why are you creating a intermediate ZimSearch ? Why not directly wrap zim::Search ?

rgaudin commented 3 years ago

Thanks, I got it working with

def get_estimated_search_results_count(self, query: str) -> int:
    cdef wrapper.ZimSearcher searcher = wrapper.ZimSearcher(dereference(self.c_archive))
    cdef wrapper.ZimQuery zim_query = wrapper.ZimQuery();
    zim_query.setQuery(query.encode('UTF-8'), False);
    return searcher.search(zim_query).getEstimatedMatches();
cdef extern from "zim/search.h" namespace "zim":
    cdef cppclass ZimSearch "zim::Search":
        int getEstimatedMatches()

Gonna check the tests and continue with the search results.

mgautierfr commented 3 years ago

Be careful that create a new zim::Search maybe time consuming. If you will reuse the Search it is better to store it somewhere.

zim::Search* m_search = nullptr;
[...]

m_search = new zim::Search(ZimSearcher.search(query));
rgaudin commented 3 years ago

OK, in that previous case it was not to be reused but I do need a dedicated instance of the SearchResultSet to iterate over the results

def suggest(self, query: str, start: int = 0, end: int = 10) -> Generator[str, None, None]:
    cdef wrapper.ZimSearcher searcher = wrapper.ZimSearcher(dereference(self.c_archive))
    cdef wrapper.ZimQuery zim_query = wrapper.ZimQuery();
    zim_query.setQuery(query.encode('UTF-8'), True);
    results = searcher.search(zim_query).getResults(start, end);
    cdef wrapper.SearchIterator it = results.begin();
    while it != results.end():
        yield it.getPath().decode('UTF-8')
        preincrement(it)
cdef extern from "zim/search_iterator.h" namespace "zim":
    cdef cppclass SearchIterator:
        SearchIterator()
        SearchIterator operator++()
        bint operator==(SearchIterator)
        bint operator!=(SearchIterator)
        string getPath()
        string getTitle()

cdef extern from "zim/search.h" namespace "zim":
    cdef cppclass SearchResultSet:
        ctypedef SearchIterator iterator;
        iterator begin()
        iterator end()
        int size()

cdef extern from "lib.h":
    cdef cppclass ZimSearcher:
        ZimSearcher()
        ZimSearcher(const ZimArchive zimfile)
        ZimSearcher(vector[const ZimArchive] zimfiles)

        Search search(ZimQuery query)

cdef extern from "zim/search.h" namespace "zim":
    cdef cppclass Search "zim::Search":
        SearchResultSet getResults(int start, int end)
        int getEstimatedMatches()

and I'm back to the same issue.

   libzim/wrapper.cpp:13954:37: error: no matching constructor for initialization of 'zim::SearchResultSet'
      new((void*)&(p->__pyx_v_results)) zim::SearchResultSet();
                                        ^
    /Users/reg/src/pylibzim/include/zim/search.h:214:5: note: candidate constructor not viable: requires single argument 'p_internalDb', but no arguments were provided
        SearchResultSet(std::shared_ptr<InternalDataBase> p_internalDb);
        ^
    /Users/reg/src/pylibzim/include/zim/search.h:198:7: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided

When I try what you proposed above, it says the object can't be converted to a Python object…

mgautierfr commented 3 years ago

You need to use a child class of zim::Search to return a pointer to a zim::SearchResultSet (as for ZimEntry which returns a pointer to a ZimItem)

// In lib.h
class ZimSearch : public zim::Search
{
    SearchResultSet* getResults(int start, int end)
    { return to_ptr<SearchResultSet>(zim::Search::getResults(start, end)); }
}

(Wrap ZimSearch instead of zim::Search in wrapper.pxd)

def suggest(self, query: str, start: int = 0, end: int = 10) -> Generator[str, None, None]:
    cdef wrapper.ZimSearcher searcher = wrapper.ZimSearcher(dereference(self.c_archive))
    cdef wrapper.ZimQuery zim_query = wrapper.ZimQuery();
    zim_query.setQuery(query.encode('UTF-8'), True);
    cdef wrapper.SearchResultSet* results = searcher.search(zim_query).getResults(start, end);
    cdef wrapper.SearchIterator it = results.begin();
    while it != results.end():
        yield it.getPath().decode('UTF-8')
        preincrement(it)
    delete results # See comment below.

But you have to be careful about memory allocation, deleting the results at the end of the function is not enough as the user can simply stop interate on the generator and so never reach the end of the function.

The best current way to ensure the SearchResultSet is correctly deleted is to wrap it in a python class and delete the pointer in the destructor of the class. This is what is done for Entry : the from_entry(wrapper.ZimEntry* entry) construct a instance on top of the cpp entry pointer and the __dealoc__ delete the given pointer. Then, all methods returning a Entry create the pointer and wrap it in the Entry (as a lot of method of PyArchive do.)

Here, as you never return the SearchResultSet, we could use a std::unique_ptr instead.


The global picture is :

[1] Entry is a both ends :

So zim::Entry is wrap twice : by ZimEntry as a producer of wrapped object, and by Entry as a produced object.