Access to secondary databases

ederag commented 4 years ago

There is a need to access other hamster databases as well, not only the main hamster.db, e.g. for merging a foreign database into the main one (#340). As recently discussed, this is not possible using the org.gnome.Hamster D-Bus service, which is tied to the main database.

One possibility: 1) add an option (--no-dbus ?) to hamster-cli to bypass D-Bus and use db.Storage directly. 2) add an option (--db-file=<filename>) to hamster-cli. 3) Edit: Still not fully sorted out, but the following one was probably being over-cautious, as correctly indicated by @GeraldJansen below. Of course this access is risky (there is no central server to avoid simultaneous write access in this case), so if the server is running, it should be forbidden to access the server database. (that one will remain hard-coded)

Still thinking, open to discussion.

GeraldJansen commented 4 years ago

It seems to me that you underestimate sqlite3's capabilities for handling concurrent requests from multiple processes and overestimate the risk associated with accessing hamster.db directly, by-passing the hamster-service D-Bus interface (eg. here). From a bit of reading, for example

in my assessment, considering the small size our hamster.db files, the processing speed of even oldish laptops/desktops and the low likelihood of concurrent write requests, the "risk" associated with accessing hamster.db without using the D-Bus interface is extremely low.

ederag commented 4 years ago

Thanks for the links ! Lots of food for thought.

ederag commented 4 years ago

The links were spot-on, and reading carefully is fully convincing that sqlite is robust and perfect for hamster, sql writes from different processes just queue up. (timeout is possible, but the database integrity remains) It's not a matter of probabilities, but of how sqlite works. What a relief, thanks Gerald !

Next step is to check that the start_transaction / end_transaction blocs are enough to protect from database changes during python code execution, in particular this call in the middle of __add_fact__ should be checked: https://github.com/projecthamster/hamster/blob/6fa4c20266c6324bffd722a53e3f4c6b4b90af2f/src/hamster/storage/db.py#L665 Not saying there is an issue, just that it deserves to be checked twice.

The first point is almost done (PR #573). It will be used to test the second one, which should be simpler.

matthijskooijman commented 4 years ago

I actually wonder if a --no-dbus option is needed at all for this? I can imagine two flavours of "access to secondary databases":

Running hamster normally, with multiple databases active (e.g. loading data from two databases, writing data to one, or maybe other ways to decide where to write data).
Running a one-off "merge data" process.

I think the point of this issue is option 2. As for option 1, this would probably require invasive changes, which would be better suited to include in the rewrite rather than the current codebase, I believe. Also, if this would be supported usecase, I think it should actually access both databases through dbus (maybe even present them as a single db somehow), rather than bypassing dbus for this usecase.

As for option 2., I wonder if this is not something that can be implemented already? I would think you can create a script that creates two instances of hamster.storage.db.Storage and merges them into either one (or a new db) without accessing dbus at all? Or maybe even more elegant, a script that creates one extra storage and merges that into the main storage accessed through dbus. I can imagine that this needs some refactoring for code that currently assumes there is a dbus connection available and/or assumes there is only one storage available, but I expect that this would need a lot less than what #573 implements.

GeraldJansen commented 4 years ago

Yes, the point of this issue is option 2, with the proviso that "one-off" could include frequent (say daily) synchronization of databases (say work PC and home PC).

An external script (something like htool) is okay for the one-off case, but seriously risks obscurity and abandonment. Integrating that functionality into hamster-cli.py would be better. All that is needed is something like

hamster --no-dbus --db-file=<some/hamster.db> export tsv [date1 [time1]] [date2 [time2]] > some.tsv
hamster --no-dbus --db-file=<other/hamster.db> load tsv some.tsv

A lot of flexibility is gained if the whole CLI can use hamster.storage.db.Storage directly instead of the dbus interface, thereby avoiding having to be careful about stopping dbus services tied to a specific database file. This would also very convenient for mock databases for development and testing, so you don't have to continuously be worried about corrupting your precious "real" hamster.db.

matthijskooijman commented 4 years ago

Yes, the point of this issue is option 2, with the proviso that "one-off" could include frequent (say daily) synchronization of databases (say work PC and home PC).

Ah, good point. I was thinking about an "importing data from an old machine" kind of setup, but tracking on multiple machines indeed makes sense.

And indeed, just using separate export and load commands seems like an elegant way to support this usecase, while also allowing the user to do some preprocessing before import if needed.

Of course, you can still do a slightly less flexible version of this without --no-dbus, e.g.:

hamster importdb <some/hamster.db>

which would import the contents of the given db into the main db (accessed through dbus). This could be implemented by creating a secondary db.Storage next to the main client.Storage, as I previously suggested.

However, I like the elegancy and flexibility of importing using a load tsv (or similar) command.

Thanks for adding some examples :-)

projecthamster / hamster

Access to secondary databases #570