loopbackio / loopback-connector-cloudant

LoopBack Connector for IBM Cloudant
Other
19 stars 21 forks source link

[Spike] Add Support for Partitioned Databases #214

Closed jdmintz closed 5 years ago

jdmintz commented 5 years ago

Both Cloudant and CouchDB support partitioned databases which make querying less expensive - computationally in CouchDB, monetarily on Cloudant.

https://cloud.ibm.com/docs/services/Cloudant/guides?topic=cloudant-database-partitioning

Please consider adding this support for Loopback users.

(Updated by JannyHou):

Design thought see https://github.com/strongloop/loopback-connector-cloudant/issues/214#issuecomment-536586133 and reference in https://github.com/strongloop/loopback-connector-cloudant/issues/214#issuecomment-537576508

Proposal and follow up stories see https://github.com/strongloop/loopback-connector-cloudant/issues/214#issuecomment-537603423, we can break down the implementation into 6 stories accordingly.

Acceptance Criteria

Reference

dhmlau commented 5 years ago

cc @raymondfeng @jannyHou

raymondfeng commented 5 years ago

See https://github.com/cloudant/nodejs-cloudant/commit/d4003759b006fecb4f361527a5495ce347285ed1

@jannyHou I think we need to switch to @cloudant/cloudant as the driver now.

dhmlau commented 5 years ago

Gathering information from @jannyHou and @raymondfeng, we'll need to have further investigation to see what's the work involved. It might involve one or more of the following:

I'd like to make this task as a spike first.

agnes512 commented 5 years ago

Discussion from the estimation meeting:

  1. check partition key settings
  2. investigate how to pass partition key to drivers
  3. notice that couchDB connector has the most implementations but CloudantDB connector only has a few
jannyHou commented 5 years ago

Did a quick research, we can do the following to support partitioned database:

Will update more finding and PoC asap.

Some notes:

jannyHou commented 5 years ago

A summary of partitioned index and query:

Partitioned Search

use case 1

partitioned index defined partitionedFind()

✔️ correct result is returned ✔️ index is used

use case 2

partitioned index defined partitionedFind() sort with a field in index

✔️ correct result is returned ✔️ index is used

use case 3

partitioned index defined partitionedFind() sort with a field NOT in index

✖️ error - index not found

use case 4

partitioned index defined partitionedFind() sort with a field NOT in index

✖️ error - index not found

use case 5

partitioned index defined partitionedFind()

✔️ correct result is returned ✔️ index is used

use case 6

global index defined partitionedFind()

✔️ correct result is returned ✖️ no matched index to optimize the performance

Global Search

use case 1

partitioned index defined find()

✔️ correct result is returned ✖️ no matched index to optimize the performance

use case 2

global index defined find()

✔️ correct result is returned ✔️ index is used

Advanced Query

use case 1

partitioned index defined partitionedFind() advanced query

✔️ $regex ✔️ nested property (e.g. address.city) ✔️ array search (e.g. $elemMatch)

jannyHou commented 5 years ago

Proposal for Partitioned Support

Design thought see https://github.com/strongloop/loopback-connector-cloudant/issues/214#issuecomment-536586133 and https://github.com/strongloop/loopback-connector-cloudant/issues/214#issuecomment-537576508

Example: Order model attached to Cloudant datasource.

jdmintz commented 5 years ago

Awesome progress here. CouchDB 2.x doesn't support Partition Querying yet. It will be available in 3.0 (being wrapped up soon-ish according to the email group)

Can confirm that the Cloudant Developer edition was retired in favor of CouchDB containers.

bajtos commented 5 years ago

Re proposal 2:

con: it's mixed with other properties, if the document just have a field called partitionKey, then the query will be broken...Maybe we can name it as lb_partition_key as a preserved field name as a solution?

Please note the filter argument of find method has properties where, include, skip, limit, etc. Model properties are nested under the where field. There is no need to worry about partitionKey clashing with model properties.

I don't know anything about Partition Queries in Cloudant/CouchDB. Purely from the LoopBack server & client perspective, I like the proposal 2 most.

Are there any security implications to be aware of? Can the partitionKey property be exploited by a malicious client?

For a non-partitioned db, db.insert() will give the document a random string as _id if it’s missing in the payload. While for partitioned database, _id is a must provide field, the random string need to be generated using 'uuid/v4', see some examples in the driver repo. Let's figure out a better UX for users to provide the id part.

In LB3 days, we have offline-sync feature, where data is created on the client first (including an autogenerated uuid/v4 value, and then synced with the server. Can we use this feature for Cloudant/CouchDB too?

  • If _id is provided then we honor the entire string.
  • If the id part is missing but partitionKey is provided, we generate a uuid and append it after the partition.
  • If neither of them provided the request will be rejected by the cloudant service.

Sounds good to me. Personally, I'd ask developers to configure the id property as follows:

{ 
  type: 'string',
  id: true, 
  defaultFn: 'uuidv4'
}

In my limited understanding of the problem domain, this may be all that's needed to make things work with Cloudant:

jannyHou commented 5 years ago

@bajtos Thank you for the detailed review and giving feedback!

Please note the filter argument of find method has properties where, include, skip, limit, etc. Model properties are nested under the where field. There is no need to worry about partitionKey clashing with model properties.

Good point! If unknown filter properties are not removed from the request(IIRC they aren't) this will definitely be a decisive reason to choose proposal 2.

Are there any security implications to be aware of? Can the partitionKey property be exploited by a malicious client?

My understanding is, partitionKey is similar to the path parameter in an url, so it's ok to make it public.

E.g. "GET /users/{id}/orders" VS "get all the orders with partitionKey equals to someUserId" I can double check this.

I'd ask developers to configure the id property as follows:

{ 
type: 'string',
id: true, 
defaultFn: 'uuidv4'
}

Sound good 👍

If the client did not provide it, then we generate a unique one (irrespectively of partitioning setup)

The pattern for _id in a partitioned db is partitionKey: id, uuidv4 can only generate id. And partitioned database does NOT allow inserting a document without the partitionKey as prefix, that's why user will have to provide at least the partition key part, or the full _id.

jannyHou commented 5 years ago

Follow up stories created:

emonddr commented 5 years ago

Design thought see #214 (comment) and #214 (comment)

same link both times.

emonddr commented 5 years ago

In Cloudant.prototype.find(), if partitionKey is detected in the query, then invoke db.partitionedFind('', queryWithPartitionKeyExcluded)

if partition key is provided, shouldn't it be queryWithPartitionKeyIncluded ?

emonddr commented 5 years ago

While for partitioned database, _id is a must provide field, the random string need to be generated using 'uuid/v4', see some examples in the driver repo

For a partitioned database, isn't the partition key used as the id value?

If so, I am confused about this statement:

If the id part is missing but partitionKey is provided, we generate a uuid and append it after the partition.

jannyHou commented 5 years ago

@emonddr Thank you for the reiview!

Design thought see #214 (comment) and #214 (comment) same link both times.

If you click on the links they have different anchors, I typed the full address, the names are auto generated(converted) by github.

if partition key is provided, shouldn't it be queryWithPartitionKeyIncluded ?

Ah let me explain, db.partitionedFind() takes in two arguments, the first is the partitionKey as a string, the second is the rest of the query(that why I put it as query with partition key EXCLUDED)

For a partitioned database, isn't the partition key used as the id value?

Not really, the pattern of an _id in the partitioned db is <partition_name>: id, it consists of two parts:

Does the id related proposal make more sense now?

jannyHou commented 5 years ago

Had a chat with @raymondfeng , here is a summary:

model definition

Ideally, a partition key would map to a model property, like model User has countryCode as its partition key. This would be consistent with the behavior in loopback-connector-cassandra. E.g.

customers = db.define('customers', {
  userId: {type: Number, id: true},
  countryCode: {type: String, isPartitionKey: true},
  name: String,
  zipCode: Number,
});

CRUD APIs

create(user)

compose _id as <countryCode>: userId then create document by calling db.insert()

find(user, options)

findById

connector parses the provided id, if it's in pattern partitonKey: id, then invoke db.partitionedFind()

use case 1 findById('fdb2ff86-78c1-47bb-bc63-f239db06c578')

use case 2 findById('USA: fdb2ff86-78c1-47bb-bc63-f239db06c578')

Follow-up stories

Stage one

Stay the same

Stage two

Developers still need to provide a full _id with partition key and uuid when create a model instance. We support read partition key from query options And optimize findById

Stage three

jannyHou commented 5 years ago

See epic https://github.com/strongloop/loopback-connector-cloudant/issues/219 Closing this spike

dhmlau commented 5 years ago

MVP - epic #219 Post MVP (enhancement) - epic #222