papis / papis

Powerful and highly extensible command-line based document and bibliography manager.
http://papis.readthedocs.io/en/latest/
GNU General Public License v3.0
1.42k stars 99 forks source link

Update database model to allow for whoosh backend #31

Closed kskyten closed 6 years ago

kskyten commented 6 years ago

Using Whoosh would enable a more powerful query language and probably make the queries more performant. How is the querying done currently and what would need to be done to add whoosh?

alejandrogallo commented 6 years ago

sorry for the delay. I think it is a very good idea. I think I'll give it a try over christmas probably.

alejandrogallo commented 6 years ago

Hi everyone again,

some news about what I've been working on to address this issue.

All this is in the branch database

Database module

There is a new papis module called database. Now, since the previous way of caching the documents was actually a purelly ad-hoc solution since papis was not intended to be used with a database, some work had to be done to actually adopt a database model, which might be still quite crude, that's why I'd like to discuss with you guys.

Now papis to interact with documents should always involve a database object managing the documents.

Now there are two databases in place all having an api to the rest of the papis code. The common api is logically

class Database:
    def __init__(self, library=papis.config.get_lib()):
    def get_lib(self):
    def get_dir(self):
    def match(self, document, query_string):
    def clear(self):
    def add(self, document):
    def update(self, document):
    def delete(self, document):
    def query(self, query_string):

I have gathered all previous functions and sticked them inside the papis.database.cache.Database database.

There is an implementation too using whoosh (which is much much faster) in papis.database.whoosh.Database which can be selected (of course also on a library basis) through the otion database-backend = whoosh.

Only the rudimentary usage of whoosh is there now, that's why I'm asking if someone is interested in taking a look, and maybe learn more about whoosh and improve it, also try it. Maybe it would be a good idea to include whoosh in the next version.

Open questions

  1. Whoosh has a way more powerful query language, which supports ands and ors, i.e.

    papis open author:einstein OR author:heisenberg

    etc... Whoosh has really a lot of features. This means, should we use this query language for everywhere where a query language is used in papis ?

  2. Should we kill the papis cache library ? Or leave it in place for trivial libraries and small libraries ? Or when whoosh is not available (although whoosh is pure python, which is very nice, and installing it is a charm). Maybe just using whoosh altogether and kill everything else is the way to go also to simplify the code.

  3. Right now there are two main points in the querying. On the one hand there is the database querying, which in the case of whoosh goes like

papis open 'author:einstein OR year:1923 AND title:physik'

and then, when the database returns the documents through the query method, the papis picker will let the user pick between them. This means that there is an input for the picker, in the case of rofi, the picking is done by rofi, in the case of papis.pick (curses), the picking is done through the config option match-format and fuzzy matching, in the case of dmenu through dmenu's fuzzy system etc...

So we have 2 things, querying and picking, Is this confusing? Users will be able to differentiate between both? This is very relevant for the web application for instance @PatWie.

This is a lot of information to digest, so I'll let you guys digest it for a while and if you want we can discuss a roadmap.

Thank you all !

alejandrogallo commented 6 years ago

This is now solved with version v0.6