replikativ / datahike

A fast, immutable, distributed & compositional Datalog engine for everyone.
https://datahike.io
Eclipse Public License 1.0
1.62k stars 95 forks source link

feat: babashka pod #630

Closed TimoKramer closed 1 year ago

TimoKramer commented 1 year ago

SUMMARY

This PR adds the functionality to run Datahike as a babashka-pod

Checks

Feature
TimoKramer commented 1 year ago
1. I think the `dbs` and `conns` vars used for caching in `pod.clj` can cause unbounded memory usage if you would deref a lot of different dbs. Datalevin tends to make such decisions for speed (in case this code is inspired by its pod code), but I think this is a bit problematic.

How would you approach it @whilo ?

whilo commented 1 year ago

@TimoKramer I read it in more detail now. You introduce a pointer system with strings to DB records and conns on this API surface in pod.clj. Since you cannot track when these pointers run out of scope externally there is no way to free memory automatically, users would have to do so manually. You could do so by providing functions free-db and free-conn that remove the entries from the respective atom.

Alternatively you could not introduce separate pointers, but pass db config maps (both for db and conn) to this API as well and use clojure.core.cache to save yourself from having to create conns or db records on every call (this is how the dhi client works, it reads in a config file for each element, and understands scoping with the conn: or db: prefix, e.g. db:/home/timo/bb_dh.edn reads the edn config file and derefs it as a db. This has the downside that you need to carry the literal maps for the connection around and pass them in, which could be slow if they are big and you need to do a lot of string serialization (I think it is fine though). Note also that this is similar to naming dbs on the server, where you introduce an additional pointer system in form of names instead of just using self-describing literals in form of the config maps over REST.

Does that make sense? We can go with the current approach and ignore the leak, but changing it later would break the API for users, which is bad.

TimoKramer commented 1 year ago

I thought about the approach you have in cli and I wonder how the user is able to continue using a db-object. Let's say I deref a conn for a query and want to keep my view of the world consistent for a future query so I pass the db-object around. How would you do that with the approach you are using in the cli-ns? I thought about using a with-db-macro like with-open in clojure.java.io to make the db-ref ephemeral and the idea to manually garbage collect the dbs is also a good idea. Thanks for your explanation, I was aware of the problem but was hoping for that kind of discussion.

whilo commented 1 year ago

There is a way to refer to the same DB by using as-of (when history is active) or by pointing to a snapshot by commit-id (not implemented yet, part of the experimental https://github.com/replikativ/datahike/blob/main/src/datahike/experimental/versioning.cljc). with-open would work in babashka, but it does not exist in general in other languages (e.g. shells, so for the cli it would not work, I think). You could just call free-db on .close in this case I think.

TimoKramer commented 1 year ago

@whilo do you like the solution?