jdmintz commented 5 years ago

Both Cloudant and CouchDB support partitioned databases which make querying less expensive - computationally in CouchDB, monetarily on Cloudant.

https://cloud.ibm.com/docs/services/Cloudant/guides?topic=cloudant-database-partitioning

Please consider adding this support for Loopback users.

(Updated by JannyHou):

Design thought see https://github.com/strongloop/loopback-connector-cloudant/issues/214#issuecomment-536586133 and reference in https://github.com/strongloop/loopback-connector-cloudant/issues/214#issuecomment-537576508

Proposal and follow up stories see https://github.com/strongloop/loopback-connector-cloudant/issues/214#issuecomment-537603423, we can break down the implementation into 6 stories accordingly.

Acceptance Criteria

[ ] Investigate the work required to support the partitioned database in Cloudant. It could be, but not limited to, one or more of the following:
- update to @cloudant/cloudant driver (mentioned above)
- user can utilize the partitioned database feature by changing some configurations
- some changes need to be done within the loopback-connector-cloudant connector to support that.
[ ] Create follow up tasks to implement this if changes are required on the LB side
- If applies, make sure we include documentation to let our users know how to enable the connector to support partitioned database.

Reference

dhmlau commented 5 years ago

cc @raymondfeng @jannyHou

raymondfeng commented 5 years ago

See https://github.com/cloudant/nodejs-cloudant/commit/d4003759b006fecb4f361527a5495ce347285ed1

@jannyHou I think we need to switch to @cloudant/cloudant as the driver now.

dhmlau commented 5 years ago

Gathering information from @jannyHou and @raymondfeng, we'll need to have further investigation to see what's the work involved. It might involve one or more of the following:

update to @cloudant/cloudantdriver (mentioned above)
user can utilize the partitioned database feature by changing some configurations
some changes need to be done within the loopback-connector-cloudant connector to support that.

I'd like to make this task as a spike first.

agnes512 commented 5 years ago

Discussion from the estimation meeting:

check partition key settings
investigate how to pass partition key to drivers
notice that couchDB connector has the most implementations but CloudantDB connector only has a few

jannyHou commented 5 years ago

Did a quick research, we can do the following to support partitioned database:

Upgrade the driver to be cloudant/nodejs-cloudant
Expose partition related APIs from cloudant/nodejs-cloudant so that users can leverage native APIs to create db/query records in a particular partition/etc...
Make sure the database is created to support partition. (Either manually or by calling LB APIs.)
Modify the index to distinguish between partitioned and global search
- A comparison between partitioned search vs global search see comment
- The index for 'loopbackmodelname' should still be global.
- When people define the index, they need to configure the partitioned as true/false, default to false to support global search
- If an index is partitioned:
- Do we have to specify the index name in the query? Please note the mongo query system will automatically search for the best matched global index for a global query.
  - Yes
- Does partitioned query support the same syntax of the global mongo query system?
  - Yes
Insert proper id field in pattern <partition_name>: id
- Problem: For a non-partitioned db, db.insert() will give the document a random string as _id if it’s missing in the payload. While for partitioned database, _id is a must provide field, the random string need to be generated using 'uuid/v4', see some examples in the driver repo. Let's figure out a better UX for users to provide the id part.
- Proposal: If _id is provided then we honor the entire string. If _id is missing but partition is provided, we generate a uuid and append it after the partition. If neither of them provided the request will be rejected by the cloudant service.

Will update more finding and PoC asap.

Some notes:

I was thinking of having a config to enable modelName as the default partition key, but after reading topic "a good partition key", I realized it won't be a good practice. Like new orders are better to have the value of userId as the their partition key instead of order. Therefore we should only honor the partition key from the request itself.

jannyHou commented 5 years ago

A summary of partitioned index and query:

Partitioned Search

use case 1

partitioned index defined partitionedFind()

✔️ correct result is returned ✔️ index is used

use case 2

partitioned index defined partitionedFind() sort with a field in index

✔️ correct result is returned ✔️ index is used

use case 3

partitioned index defined partitionedFind() sort with a field NOT in index

✖️ error - index not found

use case 4

partitioned index defined partitionedFind() sort with a field NOT in index

✖️ error - index not found

use case 5

partitioned index defined partitionedFind()

✔️ correct result is returned ✔️ index is used

use case 6

global index defined partitionedFind()

✔️ correct result is returned ✖️ no matched index to optimize the performance

Global Search

use case 1

partitioned index defined find()

✔️ correct result is returned ✖️ no matched index to optimize the performance

use case 2

global index defined find()

✔️ correct result is returned ✔️ index is used

Advanced Query

use case 1

partitioned index defined partitionedFind() advanced query

✔️ $regex ✔️ nested property (e.g. address.city) ✔️ array search (e.g. $elemMatch)

jannyHou commented 5 years ago

Proposal for Partitioned Support

Design thought see https://github.com/strongloop/loopback-connector-cloudant/issues/214#issuecomment-536586133 and https://github.com/strongloop/loopback-connector-cloudant/issues/214#issuecomment-537576508

Example: Order model attached to Cloudant datasource.

Update driver to https://github.com/cloudant/nodejs-cloudant
Find a new db (support partition)for test
- The docker one is retired and don't support partition
- Check the couchdb2 image as well, and replace it with a new one if necessary. (since 95% of the code are written in loopback-connector-couchdb2)
The global search and index stay the same
- By default Order.find() still invokes db.find() as it is.
- By default an index is created with {partitioned: false} as global.
- When people define the index in the model definition, they can configure the partitioned as true/false.
(UX)Optimize query with partition
- We need to think of a signature to distinguish a partitioned query from a global one
- Proposal 1: passes in an option in the find method, like Order.find({<query_object>}, {partitionKey: '<name_of_key>'})
- In Cloudant.prototype.find(), if options.partitionKey is provided, then invoke db.partitionedFind('<name_of_key>', query)
- pro: it won't be mixed with other query properties.
- con: request won't have options, you can only call function with options from code.
- Proposal 2: provide the partitionKey in query, like Order.find({partitionKey: 'akey', somefield: 'somevalue'})
- In Cloudant.prototype.find(), if partitionKey is detected in the query, then invoke db.partitionedFind('<name_of_key>', queryWithPartitionKeyExcluded)
- pro: you can provide the partitionKey in the request
- con: it's mixed with other properties, if the document just have a field called partitionKey, then the query will be broken...Maybe we can name it as lb_partition_key as a preserved field name as a solution?
- Proposal 3: We just define a new connector method that wraps db.partitionedFind()
- Just like how we wrap viewDocs and support the view search.
- pro: the simplest solution
- con: no model api to use, you have to call it from connector(personally I wouldn't recommand this proposal)
(UX)Insert when _id is missing
- Problem: For a non-partitioned db, db.insert() will give the document a random string as _id if it’s missing in the payload. While for partitioned database, _id is a must provide field, the random string need to be generated using 'uuid/v4', see some examples in the driver repo. Let's figure out a better UX for users to provide the id part.
- Proposal:
  - If _id is provided then we honor the entire string.
  - If the id part is missing but partitionKey is provided, we generate a uuid and append it after the partition.
  - If neither of them provided the request will be rejected by the cloudant service.
Document how to call other partition APIs from connector
- There are a brunch of other parition APIs like db.partitionInfo(), db.partitioned.List(), db.partitionedSearch(). People can just get the driver instance from the connector instance and execute these native driver APIs.

jdmintz commented 5 years ago

Awesome progress here. CouchDB 2.x doesn't support Partition Querying yet. It will be available in 3.0 (being wrapped up soon-ish according to the email group)

Can confirm that the Cloudant Developer edition was retired in favor of CouchDB containers.

bajtos commented 5 years ago

Re proposal 2:

con: it's mixed with other properties, if the document just have a field called partitionKey, then the query will be broken...Maybe we can name it as lb_partition_key as a preserved field name as a solution?

Please note the filter argument of find method has properties where, include, skip, limit, etc. Model properties are nested under the where field. There is no need to worry about partitionKey clashing with model properties.

I don't know anything about Partition Queries in Cloudant/CouchDB. Purely from the LoopBack server & client perspective, I like the proposal 2 most.

Are there any security implications to be aware of? Can the partitionKey property be exploited by a malicious client?

For a non-partitioned db, db.insert() will give the document a random string as _id if it’s missing in the payload. While for partitioned database, _id is a must provide field, the random string need to be generated using 'uuid/v4', see some examples in the driver repo. Let's figure out a better UX for users to provide the id part.

In LB3 days, we have offline-sync feature, where data is created on the client first (including an autogenerated uuid/v4 value, and then synced with the server. Can we use this feature for Cloudant/CouchDB too?

If _id is provided then we honor the entire string.

If the id part is missing but partitionKey is provided, we generate a uuid and append it after the partition.

If neither of them provided the request will be rejected by the cloudant service.

Sounds good to me. Personally, I'd ask developers to configure the id property as follows:

{ 
  type: 'string',
  id: true, 
  defaultFn: 'uuidv4'
}

In my limited understanding of the problem domain, this may be all that's needed to make things work with Cloudant:

If the client provides id, then we honor the value
If the client did not provide it, then we generate a unique one (irrespectively of partitioning setup)

jannyHou commented 5 years ago

@bajtos Thank you for the detailed review and giving feedback!

Please note the filter argument of find method has properties where, include, skip, limit, etc. Model properties are nested under the where field. There is no need to worry about partitionKey clashing with model properties.

Good point! If unknown filter properties are not removed from the request(IIRC they aren't) this will definitely be a decisive reason to choose proposal 2.

Are there any security implications to be aware of? Can the partitionKey property be exploited by a malicious client?

My understanding is, partitionKey is similar to the path parameter in an url, so it's ok to make it public.

E.g. "GET /users/{id}/orders" VS "get all the orders with partitionKey equals to someUserId" I can double check this.

I'd ask developers to configure the id property as follows:
{ 
type: 'string',
id: true, 
defaultFn: 'uuidv4'
}

Sound good 👍

If the client did not provide it, then we generate a unique one (irrespectively of partitioning setup)

The pattern for _id in a partitioned db is partitionKey: id, uuidv4 can only generate id. And partitioned database does NOT allow inserting a document without the partitionKey as prefix, that's why user will have to provide at least the partition key part, or the full _id.

jannyHou commented 5 years ago

Follow up stories created:

emonddr commented 5 years ago

Design thought see #214 (comment) and #214 (comment)

same link both times.

emonddr commented 5 years ago

In Cloudant.prototype.find(), if partitionKey is detected in the query, then invoke db.partitionedFind('', queryWithPartitionKeyExcluded)

if partition key is provided, shouldn't it be queryWithPartitionKeyIncluded ?

emonddr commented 5 years ago

While for partitioned database, _id is a must provide field, the random string need to be generated using 'uuid/v4', see some examples in the driver repo

For a partitioned database, isn't the partition key used as the id value?

If so, I am confused about this statement:

If the id part is missing but partitionKey is provided, we generate a uuid and append it after the partition.

jannyHou commented 5 years ago

@emonddr Thank you for the reiview!

Design thought see #214 (comment) and #214 (comment) same link both times.

If you click on the links they have different anchors, I typed the full address, the names are auto generated(converted) by github.

if partition key is provided, shouldn't it be queryWithPartitionKeyIncluded ?

Ah let me explain, db.partitionedFind() takes in two arguments, the first is the partitionKey as a string, the second is the rest of the query(that why I put it as query with partition key EXCLUDED)

For a partitioned database, isn't the partition key used as the id value?

Not really, the pattern of an _id in the partitioned db is <partition_name>: id, it consists of two parts:

partition name, or say partition key
the unique id

Does the id related proposal make more sense now?

jannyHou commented 5 years ago

Had a chat with @raymondfeng , here is a summary:

model definition

Ideally, a partition key would map to a model property, like model User has countryCode as its partition key. This would be consistent with the behavior in loopback-connector-cassandra. E.g.

customers = db.define('customers', {
  userId: {type: Number, id: true},
  countryCode: {type: String, isPartitionKey: true},
  name: String,
  zipCode: Number,
});

CRUD APIs

create(user)

compose _id as <countryCode>: userId then create document by calling db.insert()

find(user, options)

first try get the partition key from options like {partitonKey: 'US'}
if not found, try get it from query like {where: {name: 'somename', countryCode: 'US'}}
then invoke db.partitionedFind('US', {selector: {name: 'somename'}})

findById

connector parses the provided id, if it's in pattern partitonKey: id, then invoke db.partitionedFind()

use case 1 findById('fdb2ff86-78c1-47bb-bc63-f239db06c578')

invokes db.find()
- partitioned db will reject the request
- still a valid case for non-partitioned db

use case 2 findById('USA: fdb2ff86-78c1-47bb-bc63-f239db06c578')

invokes partitionedFind()

Follow-up stories

Stage one

(WIP) #215
216

Stay the same

Stage two

Developers still need to provide a full _id with partition key and uuid when create a model instance. We support read partition key from query options And optimize findById

220 read partition key from options
217 read partitionKey from model property

Stage three

221 optimize findById
218 compose id

jannyHou commented 5 years ago

See epic https://github.com/strongloop/loopback-connector-cloudant/issues/219 Closing this spike

dhmlau commented 5 years ago

MVP - epic #219 Post MVP (enhancement) - epic #222

loopbackio / loopback-connector-cloudant

[Spike] Add Support for Partitioned Databases #214

Acceptance Criteria

Reference

A summary of partitioned index and query:

Partitioned Search

use case 1

use case 2

use case 3

use case 4

use case 5

use case 6

Global Search

use case 1

use case 2

Advanced Query

use case 1

Proposal for Partitioned Support

model definition

CRUD APIs

create(user)

find(user, options)

findById

Follow-up stories

Stage one

216

Stage two

220 read partition key from `options`

217 read partitionKey from model property

Stage three

221 optimize `findById`

218 compose id

loopbackio / loopback-connector-cloudant

[Spike] Add Support for Partitioned Databases #214

Acceptance Criteria

Reference

A summary of partitioned index and query:

Partitioned Search

use case 1

use case 2

use case 3

use case 4

use case 5

use case 6

Global Search

use case 1

use case 2

Advanced Query

use case 1

Proposal for Partitioned Support

model definition

CRUD APIs

create(user)

find(user, options)

findById

Follow-up stories

Stage one

216

Stage two

220 read partition key from options

217 read partitionKey from model property

Stage three

221 optimize findById

218 compose id

220 read partition key from `options`

221 optimize `findById`