textileio / js-threaddb

This project has been moved to https://github.com/textileio/js-textile
https://github.com/textileio/js-textile
MIT License
14 stars 1 forks source link

Full text search #36

Open cyphercider opened 3 years ago

cyphercider commented 3 years ago

Hi @carsonfarmer / Textile team,

I was just playing with this database in a browser locally and trying to figure out how I would do text search. Unless I'm missing it in the docs, I don't see any operators in the dexie-mongoify library that supports text search (i.e. the $text operator).

I ask partly because if this is a feature that is lacking, I might try my hand at contributing. I have a one-off document search mechanism in my project (using Dexie) that I could use as reference. I plan to ditch my existing solution for js-threaddb when it's ready, so I thought I might try to contribute if there is somewhere I can be helpful.

EDIT: here's the full-text search example from the Dexie repo that I used. It searches all words in a string, but only using "starts with". So you can't search for a string in the middle of a word, and it definitely doesn't have any "fuzzy search" capabilities. I'm going to keep looking around for a good full-text search solution. I'll probably pursue making a PR to add the $text operator to dexie-mongoify.

EDIT 2: I also found ydn-db-fulltext from searching around. Please let me know if you have any other leads on existing solutions for the fulltext search in IndexedDB problem :-).

carsonfarmer commented 3 years ago

This is awesome @jeffkhull! I don't currently have plans to implement full text search (though of course, it would be great), so any work you are interested in doing here is greatly appreciated. PRs definitely accepted.

cyphercider commented 3 years ago

@carsonfarmer , cross posting from dexie-mongoify so we can continue the discussion, thanks!


I have a question about strategy of functionality extensibility on js-threaddb. Do you intend to implement a plugin model similar to PouchDB? I'm wondering how fancy we should get with full-text search. Do we take a super simple approach (as with example provided by the Dexie author linked in my original post), or do we get more ambitious, with natural language parsing as in ydn-db-fulltext? I'm inclined to start with the simpler approach, particularly if approach to extensibility is uncertain.

There's quite a bit of "prior art" on this topic (as with PouchDB), but js-threaddb is unique AFAIK in that it syncs to the distributed web. I have been meaning to do some research as to whether there are any in-progress initiatives to build a more robust native storage technology in the browser, as IndexedDB itself seems a bit tired and not commensurate with the massive importance (in my opinion) of a robust and full-featured local database to drive rich app functionality for apps that rely on decentralized storage, and which must continue to function full-featured even with unstable networks or offline.

I'm getting ahead of myself a bit, but let me know what you think about proceeding with the simplest option for now and continuing to evaluate whether to bite off something more ambitious - like natural language parsing, term ranking, typo detection etc with a future release.

carsonfarmer commented 3 years ago

A the moment, there are no plans for a plugin model per se... in fact, I'd like threaddb to be less a database itself, and more just the "remote" syncing component of multiple database offerings. Building on Dexie et al right now is a means to an ends. So the more functionality that can be pushed to those projects, and/or that can be done using a plugin mechanism within those projects (but bundled with threaddb) the better.

Initially, my vote would be

proceeding with the simplest option for now and continuing to evaluate whether to bite off something more ambitious

cyphercider commented 3 years ago

in fact, I'd like threaddb to be less a database itself, and more just the "remote" syncing component of multiple database offerings.

This makes a ton of sense to me. Speaking of this, do you think it would make sense to open up a conversation with the folks behind PouchDB (and maybe others - I haven't done a comprehensive analysis of existing solutions) to see what it would take to sync to Textile rather than CouchDB? I'm sure if it were possible to use these existing solutions, it would be enormously beneficial to Textile's focus on creating a robust syncing system.

By the way, we exchanged some messages a couple months back about client-side encryption and that the current Hub implementation does encryption server side. You mentioned that you'll be looking at client side encryption in the future.... Do you see that as being something that is possible with the initial release of the threaddb sync component? In my mind, the importance of client-side encryption is huge to advance web3's vision of data privacy and e2e encryption. I believe you said the reason server-side encryption is used with Hub is to resolve conflicts. I wasn't sure how that requirement may change with the local-first approach taken by threaddb.

Initially, my vote would be

proceeding with the simplest option for now and continuing to evaluate whether to bite off something more ambitious

Sounds good! I will pursue working on a PR to dexie-mongoify to implement the simplest option for the $text operator. There is time contention with a pretty big project I'm working on right now, but I should be able to have something done before the end of the year at the latest.