Several collections in one PouchDB

ssured commented 7 years ago

As I understand currently each collection resides in a separate PouchDB on the client. Does this imply that when syncing data, each collection needs an own replication connection to the central server?

If so, would it be possible to merge collections into one db? Prefixing _id's is how I currently do, and works like a charm.

pubkey commented 7 years ago

@ssured There is currently no plan to add sub-collections into RxDB. I will play arround a bit at the weekend and check how difficult this would be and what the advantages could be.

ssured commented 7 years ago

Ok cool, thanks for answering. Maybe I can better rephrase my question as 'how do you sync RxDB to CouchDB?'. Did you do this in production? Is the only way to separately sync each collection, or is there a way to combine these into one replication stream.

My experience is that CouchDB does not handle too many Sync streams well (at 1000 open _changes feed listeners things get unstable. When you have an RxDB with 10 entities, you'll have 10 collections which use 10 connections to the DB for syncing. Thus you are limited to around 100 users.

One option would be to implement a custom sync stream using https://github.com/nolanlawson/pouchdb-replication-stream as a starting point which tags each line with the collection it came from. Using regular API has my preference though, which is why I'm asking this in the first place

Besides the connections to the DB there is another problem. Authorization of DBs in CouchDB is an tedious task. Managing all permissions is easy to get mistaken at. Having only one DB to manage _security is far easier than handling the _security table per collection.

What will happen when you ditch support for values in the _id field? My _id's are always generated from code and usually have this form:

<entitytype>-dddddddd-???????? Where <entitytype> is a string with the name of the collection Where ddddd encodes a (user) timestamp in base36: (new Date()).valueOf().toString(36) Where ?????? encodes a random string by the end user: Math.random().toString(36).substr(2,8)

For me this structure has the benefits:

query _all_docs for all entities of a certain type using startkey and endkey
created at timestamp is recorded
docs are always returned in the order of creation

pubkey commented 7 years ago

Hi @ssured I thought very long on if and how to implement "subcollection" which share the same pouchdb. I'm sorry, but I don't think that it be should implemented into RxDB.

Yes, when you sync many collections/pouchdbs with a server, you have many open data-streams which heavily affect the server-performance. pouchdb-replication-stream is an acceptable way to handle this problem and it can be currently used with RxDB.

The authorization-thing is the next problem. Here, maybe someone should create a pouchdb-plugin which reflects the authorization to many pouchdb-instances.

Using _id: The _id is, like in most other noSQL-databases, a primary index which is free of performance-costs. By abusing it via RxDB, we would "waste" this free index and the users would mostly have to create another one. When this is the default way in RxDB, the performance of RxDB would be disadvantaged compared to pouchdb, just because of the single feature. Also this would make every other piece of code more complex, because with every query it must be guaranteed that only the documents of the current subcollection are valid results.

ssured commented 7 years ago

Thanks for your thoughts on the subject;

1> From a one-collection point of view I fully agree; for multiple collections I believe there is some extra (plugin?) functionality needed to enable replication at production scale. Thanks to the nice PouchDB ecosystem that should not be too hard.

2> A plugin should needs to be developed. I use https://github.com/colinskow/superlogin in production and it works quite well; maybe extending superlogin is a quick way to a proper solution here. Another item to be developed

3> This is a hard one. I kind of disagree here. It is quite common to have meaningless id's in DB land, especially in the SQL world. From my pov it's not a waste to encode entity information in the ID to make it possible to sync in a performing way. Once https://github.com/pouchdb/pouchdb/issues/5207 lands, the way PouchDB indexes data will change and custom indexes will be much much cheaper. IDB next is quite close to delivery; there's a beta adapter already shipped in the official PouchDB release (pouchdb-next.js).

I spent some time thinking about your points above too. Another viable _id format might be to just prefix the _id field with the name of the collection. As the _id must always be a string, it's easy to prepend it with the name and a delimiter. It does indeed make processing the _id field a little bit more complex, but that should not be too hard.

{ _id: 'account-pubkey', data: ...}
{ _id: 'account-ssured', data: ...}

Querying and sorting the data works the same, only a little overhead is added. What will be (almost) impossible is to change the name of an entity.

For my application I'll need collision protection; e.g. it should not be possible for two offline apps to create the same document ID. Being able to not specify a primary ID (already possible) and being able to pass a custom _id generator to a collection (not possible yet?) will allow me to use the _id format as described in the initial post.

Please don't get me wrong; I love your project and really want to use it as it has quite some overlap with what I'm using for my products now. What attracts me so much is that I was planning to move to RxJS to connect with PouchDB, as streams are a natural fit for syncing DBs. The above points are the gaps I see for moving to RxDB. My guess is that these gaps are relevant to other people too, but I might be very very wrong... I'll play around in my head and see if I have time to work on the plugins/extras as explained above under 1> and 2>.

ssured commented 7 years ago

Can you maybe share why you closed this issue?

pubkey commented 7 years ago

Hi @ssured I'm sorry for closing this silently, this was wrong.

I'm currently working on v3.0.0 and still had in mind how we can efficiently share the same pouchDB-instance for different collections. In v3 I added the migrationStrategies and other things and I'm safe to say that there is no easy implementation of 'sub-collections' which does not create a big technical dept.

The best solution is still to use the pouchdb-replication-stream 'by hand' and it would maybe be a good idea to create a rxdb-plugin out of it to make it fit better. But since I don't need this myself, someone else will have to do this.

@ssured please reopen this issue if you plan to solve this problem.

ssured commented 7 years ago

Thanks for your answer. Big 👍 on v3, migrations is essential. I'm contemplating on moving to RxDB, if so then I think the conclusion of this issue is we'll need to start an ecosystem of addons, just like PouchDB has. Thanks!

jefbarn commented 7 years ago

This issue would also hold me up migrating my DB to rxdb. Per couch recommendations, I currently use the 'everything in one database and prefix the _id' strategy for couchdb collections, as I believe many people do (otherwise you loose the ability to do joins in map/reduce). The pouchdb-replication-stream suggestion is interesting, I'm curious how this would work? Create a pouchdb instance that would replicate with the main couchdb database, then push that to an in memory stream and use pouchdb-replication-stream to replicate with that in memory stream? It could work...maybe a little fragile. Does rxdb currently have a plugin system? How would that work?

pubkey commented 7 years ago

@jefbarn RxDB does not have a plugin-system at the moment. We have RxDB.plugin() which is currently just a tunnel to PouchDB.plugin() but the plan was to extend it to a plugin-system at a later time.

A main use-case for pouchdb, couchdb and rxdb is, that the performance-cost per db-instance is next to zero. Therefore the creation of many many collections is often the way to go and we should not micro-optimize this by merging different rxdb-collections into one instance. I think we need a rxdb-plugin which extends the pouchdb-replication-stream to make it easy to use the same stream for every collection-replication of a single RxDatabase.

I do not recommend to hack joins in map/reduce by putting every doc in the same collection. If you need joins, don't use noSQL. (Just my opinion, could be wrong)

motin commented 6 years ago

@pubkey: If I understand this correctly, this means that:

an app that uses N different collections requires a N databases per user CouchDB setup - either literally by creating CouchDB databases like johnsmith_heroes, johnsmith_games, johnsmith_todos, or possible by some clever node-proxy which exposes virtual databases and rewrites _id-values on the fly (A good starting point for such a proxy could be https://github.com/cloudant-labs/envoy which exposes virtual databases - one per user. Could be extended or modified to allow for several virtual databases per user).
https://github.com/pouchdb-community/pouchdb-replication-stream would not remove the requirement of N databases per user, but it could potentially remove the need for N replication streams per client connecting simultaneously to the CouchDB server.

PS @ssured: Did you end up moving to RxDB? :)

motin commented 6 years ago

I just found this: https://www.npmjs.com/package/rxdb-utils#replication:

replication Will allow for filtered replication of collections to a single remote instance. This would allow you to use a single remote pouchdb/couchdb database (per user, if applicable) to save all collections, instead of using one remote instance per user and collection.

In order to achieve so, all schemas will be modified by adding an rx_model property to all collections, which will be populated for all documents with the name of the collection. The key for this property will not change even if you activate key compression.

Seems like a proper right approach to solve the n-databases-per-user issue.

pubkey / rxdb

Several collections in one PouchDB #29