out-of-cheese-error / gooseberry

A command line utility to generate a knowledge base from Hypothesis annotations
Apache License 2.0
152 stars 9 forks source link

Displaying annotations is too long #75

Closed ngirard closed 2 years ago

ngirard commented 3 years ago

Using home-made binary from the trunk, on Ubuntu 20.04. gooseberry search takes more that 7 seconds to display its output for 693 annotations. That's very long. Is there any reason why ? I'm assuming no network operations are involved here, is that so ?

Ninjani commented 3 years ago

I'm assuming no network operations are involved here, is that so ?

I'm afraid gooseberry uses network operations for everything - annotations are always queried from hypothesis, the local database just stores their IDs and tags for filtering. I had a version in the past that also stored local copies and only changed the ones with an update date later than the last checked, but at that point the bincode serializer didn't support tagged enums meaning the only way to store annotations is as raw JSON which seemed like it would be pretty memory hungry. Maybe it's time to revisit this now, I've been using it on ~400 annotations so far but I've always used search with filters, so the delay was not so noticeable.

Edit: I assume you're building the binary with --release?

Ninjani commented 3 years ago

Another point here is that storing annotations locally needs the entire hypothesis filtering functionality to be re-implemented and kept up to date with whatever new filters are added to the official API.

ngirard commented 3 years ago

I assume you're building the binary with --release?

No, I was lazy to do so, and will report back with the precompiled binary that is baking as I'm writing. I guess since we're i/o-bound, there shouldn't be much difference, but hey.

Ninjani commented 3 years ago

Okay, let's see! The entire sync aspect right now is pretty much only there in case I decide to add back the local annotations thing, otherwise I can get rid of it.

ngirard commented 3 years ago

Maybe it's time to revisit this now, I've been using it on ~400 annotations so far but I've always used search with filters, so the delay was not so noticeable.

Well.

On one hand, I reported my user experience from the expectations I had, back then. I think that, if the docs explain that gooseberry uses network operations for everything, and that the preferred workflow is to limit the search with query operators such as --from, then my expectation (and I assume, others') would have been different, and I'd find a 2 second delay perfectly acceptable.

On the other hand, I'm convinced that storing the annotations locally is the way to go. Heck, maybe the reason Hypothes.is hasn't gotten much popular, is because both a freaking browser and a freaking web app are imposed intermediates between you and your data ?

Since you seem to be familiar with Obsidian, one of the biggest reasons it has gotten so popular is because it allows to manipulate your data locally, whereas the competitors (e.g. Roam) don't.

Also, perhaps you'll think I'm too opinionated, but I'm also convinced that the best way to store the annotations is a SQLite database. For many reasons I could elaborate ; but in a nutshell, it's a de facto standard ; it removes all barriers to connect your data to any other tool/need ; and data access could be made very fast to any other application using views and triggers.

What do you think ?

ngirard commented 3 years ago

Reporting back: using the precompiled binary, gooseberry search takes 5 seconds now, as opposed to 7+ with the debug build. That's not negligible !

Ninjani commented 3 years ago

Since you seem to be familiar with Obsidian, one of the biggest reasons it has gotten so popular is because it allows to manipulate your data locally, whereas the competitors (e.g. Roam) don't.

This is definitely a pretty strong point and one of the main reasons I wanted to do it that way in the first place, but then realized how much work it'd be to recreate the whole filtering system and decided to get a quick working tool first :p

I agree that 5 seconds is not negligible, and I guess it'll get worse as the number of annotations increases which is frustrating especially if you don't expect the old annotations to change much. I'll make a new issue for this and indeed look into SQLite - will take some refactoring, but this has to be done anyway since I'm planning to make an Obsidian plugin version that uses gooseberry as a library.

ngirard commented 3 years ago

I'll make a new issue for this and indeed look into SQLite - will take some refactoring, but this has to be done anyway since I'm planning to make an Obsidian plugin version that uses gooseberry as a library.

That's so nice to hear ! I'm willing to help with the SQLite stuff, I'm just unlikely to be able to devote much time on this in the next 2 months. But in any case, let's keep in touch and don't hesitate to sollicit me for anything !

ngirard commented 3 years ago

storing annotations locally needs the entire hypothesis filtering functionality to be re-implemented and kept up to date with whatever new filters are added to the official API.

I'm pretty sure it won't be a problem, given the slow pace the Hypothes.is project is advancing at...

Ninjani commented 3 years ago

I'm willing to help with the SQLite stuff, I'm just unlikely to be able to devote much time on this in the next 2 months. But in any case, let's keep in touch and don't hesitate to sollicit me for anything !

Great, will do, thanks!

Ninjani commented 2 years ago

@ngirard it's been a while but I have a PR (#99) with the local database functionality (using CBOR which actually needed pretty minimal refactoring of the codebase).

I'd like to test it out for a couple of common workflows before merging, to make sure everything stays in sync - would you be up for seeing how much difference it makes to your search times?

ngirard commented 2 years ago

Hey @Ninjani, good to hear from you ! Congrats for your phd and your postdoc position ! Hope you're doing well at your new place.

I'm actually about to start a new assignment, so it's perfect time for me to rethink my habits & workflows. I have to admit that my old habits kicked in last year and I ended up putting Gooseberry on the back burner, especially since my colleagues didn't care to annotate their information sources — hence the lack of feedback from me. I apologize for that.

I wish to adopt Gooseberry... again, so I'll take a stab at this new PR today. May I ask why you chose CBOR serialization instead of SQLite? In any case, it's your project and I think you are sovereign in your choices.

Also, I'm afraid I'll be reporting a few issues over the weekend — nothing big, fortunately.

Cheers !

Ninjani commented 2 years ago

Thanks @ngirard !

I went with CBOR because it has serde support via ciborium into a binary format which I could then store directly in the existing sled database - so quite minimal refactoring of the code-base.

Good to hear that you'll test out the PR and I welcome the issues as well.

Cheers!