tonsky / persistent-sorted-set

Fast B-tree based persistent sorted set for Clojure/Script
MIT License
81 stars 16 forks source link

Adding durability for ClojureScript #14

Open logseq-cldwalker opened 11 months ago

logseq-cldwalker commented 11 months ago

Hi @tonsky. Thanks for this handy library! This is a conversational PR to see if you'd be interested in our ClojureScript durability that @tiensonqin added as part of our datascript.storage cljs implementation. We are using this forked datascript in our product's feature branch and it is working well for our needs. The existing cljs tests on this repo are passing as our the relevant upstream datascript.storage cljs tests. I'm aware there's some minor whitespace and commenting that needs to be cleaned up here as well as cljs storage tests that need to be ported. I could help with those but for any design questions I'd defer to @tiensonqin. Would you be interested in a contribution like this?

tonsky commented 11 months ago

Yes! I initially skipped CLJS implementation because I thought localStorage is too small to be useful, and IndexedDB is async. What are you guys using for storage?

logseq-cldwalker commented 11 months ago

Cool. We're using SQLite via OPFS. From the OPFS approaches that SQLite offers we chose the OPFS pool approach because it offers a sync handle. The async approach proved too difficult and we're unsure if it's doable

tonsky commented 10 months ago

Please let me know when it’s ready for review

logseq-cldwalker commented 10 months ago

Sure. This is ready for review

logseq-cldwalker commented 10 months ago

@tonsky Sorry. There was actually still another bug with deletion. Since I'm going to be away on holiday vacation, I'm putting this back until draft until I get back the first week of next year. Any low-effort feedback would be welcome. Cheers

tonsky commented 10 months ago

Sure. Will give it a look

whilo commented 10 months ago

@logseq-cldwalker What are your main insights into asynchronous support? @pkpkpk and I have also started working on async version of the persistent-sorted-set. I think both async and sync execution models have benefits and drawbacks.

tiensonqin commented 10 months ago

@tonsky Thanks for looking into this PR and all the beautiful work for the Clojure community!

tiensonqin commented 10 months ago

@whilo Hey! I think the main reason for Logseq is that it'll require tons of effort to migrate the existing code base to be asynchronous, we still face the challenge that OPFS with SQLite can only be used in a web worker, which is great not to block the main UI thread, but it means both queries and transact will be async, we'll experiment soon with the idea to have an in-memory datascript db in the main UI thread for caching and sync the data with the full db from the worker.

It'll be nice to have async support so that people can choose to store the data in IndexedDB.

tonsky commented 10 months ago

@tiensonqin thanks to you for your consistent support and for such a significant contribution in such a tricky part of the system!

whilo commented 10 months ago

@tiensonqin Thank you for laying that out, that makes sense. A solution I originally developed for replikativ with the hitchhiker-tree and we are currently revisiting (https://github.com/replikativ/datahike/discussions/429) is to stream tree fragment deltas incrementally and then update the db root after everything is in store/storage. That way you can realize your synchronous DataScript scenario. You just need to transact first into a storage system somewhere and then react to its confirmation/updates. I think this would be a simple and nice model to synchronize logseq, but I don't know whether you want to treat the markdown files or the DataScript as the primary source of truth.

logseq-cldwalker commented 10 months ago

@tonsky Are there things you're waiting for from us? I think Tienson addressed most of the feedback

tonsky commented 9 months ago

Oops, sorry, no. I’ll take a look

tonsky commented 9 months ago

@tiensonqin

Okay, I think I finally understand what are you doing. Sorry it took so long to catch up.

I like the idea! It can’t be the default mode though, default should be normal persistent data structure where you can keep references to old copies.

But as an option, I’d like to have it too. I can imagine an app that can only keep one reference to the latest DB at all times. If that eliminates GC, I see how it is beneficial.

So let’s say we want to get rid of GC at all. Right now you have two behaviours: some addressed get erased on next store (the ones that are getteing reused, marked with _dirty), some are erased immediately as database is changed (delete storage [unused-addresses])

I propose we move it all the next store.

  1. Add some sort of queue of freed up addresses to the top level of the tree (through dynamic var?)

  2. When node gets changed/split/merged, its address is added to that queue. Node itself does not remember last address, it just gets set to nil, same as in clj implementation. This will let us get rid of _dirty flag (has .-address === stored).

  3. When the time comes to store new version of the set, we first do (delete storage unused-addresses) and then let the storage allocate new addresses. Upside: it can happen in a batch call.

So it’s not exactly address reuse, more like freeing addresses as we go and allocating new ones.

If a storage doesn’t want to clean up freed addresses immediately, it can make the implementation noop.

Would that work for you?

logseq-cldwalker commented 9 months ago

Tienson is out right now for Chinese New Year. Hopefully he can respond soon after he gets back. In the meantime, we've been able to use this PR successfully on databases up to 9.3M datoms which translates to ~1.4G on disk

tonsky commented 9 months ago

Awesome!