rokwire / core-building-block

Building block which handles core functions for the Rokwire platform - users, accounts, profiles, organizations, authentication and authorization.
Apache License 2.0
3 stars 2 forks source link

Storage interface standardization #74

Open shurwit opened 3 years ago

shurwit commented 3 years ago

Hi @petyos @stefanvit,

I have noticed that in the past we have had many issues with standardizing our storage interfaces. Below are a few issues that I would like to resolve for the Core BB.

Object IDs

Edit: After further discussion, we have decided to continue using UUID string _ids for now.

One issue is that we have used the _id field in MongoDB inconsistently. I think our policy should be to use the automatically assigned MongoDB ObjectIds to take advantage of the optimizations they provide. I also think it would make sense to use this as the globally unique random ID for objects that need one. To accomplish this without tightly coupling our data models to Mongo, we can define our own ObjectId type with a custom BSON marshaller. Here is an example of what that could look like:

package model

type Item struct {
  ID   ObjectId `json:"id" bson:"_id,omitempty" validate:"required"`
  Name string   `json:"name" bson:"name"`
  Type string   `json:"type" bson:"type"`
  ...
}

// ObjectId should be used as the primary ID for all models
type ObjectId string

func (i ObjectId) MarshalBSONValue() (bsontype.Type, []byte, error) {
    p, err := i.ToBSON()
    if err != nil {
        return bsontype.Null, nil, err
    }

    return bson.MarshalValue(p)
}

func (i ObjectId) ToBSON() (primitive.ObjectID, error) {
    return primitive.ObjectIDFromHex(string(i))
}

func ObjectIdFromBSON(id primitive.ObjectID) ObjectId {
    return ObjectId(id.String())
}

For objects that don't need a single globally unique random ID, this field will still be generated in MongoDB but can be ignored in the data model.

Storage Operations

Another issue is that there have been many different functions in the storage adapter that accomplish the same thing with a different name and/or implementation. I would like to try to keep these functions consistent going forward so it is easier to understand what each one is doing at a glance and keep an efficient standard implementation.

It seems to me that we really have 6 standard storage operations:

Clearly in some cases the parameter/return value template I outlined above for each operation will need to be modified, but I think this should be the standard unless there is a reason to do something different.

The biggest issues come up around the Save operation since there are many ways this can be accomplished. In some cases, one collection may just need Find, Save, and Delete, while in others it makes more sense to enforce separate Inserts and Updates with no Save. Sometimes we may want to provide Insert, Update, and Save for the same collection for use in different situations.

I think it is important that we keep the differences between these operations in mind and adopt the naming convention in this list.

To provide a more concrete explanation, here is a sample implementation for each of these operations:

func (sa *Adapter) LoadItems() ([]model.Item, error) {
    filter := bson.M{}
    var items []model.Item
    err := sa.db.items.Find(filter, &items, nil)
    if err != nil {
        return nil, err
    }

    return items, nil
}

func (sa *Adapter) FindItem(id string) (*model.Item, error) {
    filter := bson.M{"_id": id}
    var item *model.Item
    err := sa.db.items.FindOne(filter, &item, nil)
    if err != nil {
        return nil, err
    }

    return item, nil
}

func (sa *Adapter) InsertItem(item *model.Item) error {
    objID, err := sa.db.items.InsertOne(item)
    if err != nil {
        return fmt.Errorf("error inserting item: %v", err)
    }

    return nil
}

// Full update example
func (sa *Adapter) UpdateItem(item *model.Item) error {
    filter := bson.M{"_id": item.ID}
    err = sa.db.items.ReplaceOne(filter, item, nil)
    if err != nil {
        return nil, fmt.Errorf("error updating item with id %v: %v", item.ID, err)
    }

    return nil
}
// OR partial update example
// Note: We should expose a collection interface for "FindOneAndUpdate"
func (sa *Adapter) UpdateItem(id string, name *string, itemType *string) (*model.Item, error) {
    filter := bson.M{"_id": id}

    updates := bson.M{}
    if name != nil {
        updates["name"] = *name
    }
    if itemType != nil {
        updates["type"] = *itemType
    }

    // The "ReturnDocument" option must be set to after to get the updated document 
    opts := options.FindOneAndUpdate().SetReturnDocument(options.After)

    var item *model.Item

    err := sa.db.items.FindOneAndUpdate(filter, updates, item, opts)
    if err != nil {
        return nil, fmt.Errorf("error updating item with id %v: %v", item.ID, err)
    }

    return item, nil
}

func (sa *Adapter) SaveItem(item *model.Item) error {
    filter := bson.M{"_id": item.ID}
    opts := options.Replace().SetUpsert(true)
    err := sa.db.items.ReplaceOne(filter, item, opts)
    if err != nil {
        return fmt.Errorf("error saving item with id %s: %v", item.ID, err)
    }

    return nil
}

func (sa *Adapter) DeleteItem(id string) error {
    filter := bson.M{"_id": id}
    result, err := sa.db.items.DeleteOne(filter, nil)
    if err != nil {
        return fmt.Errorf("error deleting item with id %v: %v", id, err)
    }
    if result == nil {
        return fmt.Errorf("nil result deleting item with id %v", id)
    }
    deletedCount := result.DeletedCount
    if deletedCount == 0 {
        return fmt.Errorf("no item found to delete with id %v", id)
    }

    return nil
}

Data Models

Edit: After further discussion, we have decided to reuse simply core models in the storage adapter, but keep separate API models for all data and storage models for complex data (ie. with relations across database collections)

Finally, I have also noticed some inconsistencies around the use of data models. It seems like in some cases we have created up to 3 duplicate data models for the same data: one for requests, one in the model package, and one for storage. While I understand it is sometimes beneficial to have additional parameters that are not exposed throughout the entire service, I believe this complicates the flow of data through the system quite a bit, and adds a lot of unnecessary casting overhead.

In general, I believe we should create a single data model stored in the model package with the appropriate JSON and BSON annotations so that it can be used to marshal and unmarshal correctly from a single model. This should eliminate the need for any manual casting between MongoDB, internal service models, and request/response JSON. If a case arises that we absolutely need a different model to handle request data or storage data then we can have exceptions to this rule, but I think we should try to stick to this principle. Please let me know if you see any issues with this, or if there is a reason that we would need separate models that I am overlooking.

Please let me know if you have any questions or suggestions about any of these points. Thanks!

petyos commented 3 years ago

Hi @shurwit ,

Thanks for bringing it up.

Object IDs

As I know in Mongo It should not have any differences in the optimizations for these two entities:

Using Mongo ObjectID type

{
    "_id": {
        "$oid": "60ee5765c93e7d485f670231"
    },
    ....
}

and using string

{
    "_id": "0a2eff20-e2cd-11eb-af68-60f81db5ecc0",
   ...
}

In Mongo if you set field "_id" then it automatically sets it to be the identifier.

The reason we started using simple string instead of the Mongo type was that we had to make convert it to non Mongo specific type which can be used in the model.

If you we add our own model.ObjectID type in the model package as you suggest then it will add to the model package the mongo driver library as dependency as it uses primitive.ObjectID. The model package should not know anything about Mongo - It is Mongo today, tomorrow it could MySQL, Oracle etc. Ok, we could add ObjectID type in our model and to make the convertion in the Mongo adapter keeping the model not knowing anything about Mongo but I think here the question would be - is it worth?

So in my opinion if there is no changes in the performance between

{
    "_id": {
        "$oid": "60ee5765c93e7d485f670231"
    },
    ....
}

and

{
    "_id": "0a2eff20-e2cd-11eb-af68-60f81db5ecc0",
   ...
}

then it does not worth but if there is then it worths. What is your opinion of this approach?

Another issue we should check is to verify how this mongo ObjectID could be referenced from inner document. As general we are having data model with many relations which could match to SQL database very well and requires some efforts to be implemented in the NoSQL world. This is why I am afraid if this mongo ObjectID type would give value for us or more noise. Not sure yet! So maybe the answer will be given from this if the performance is the same for Mongo ObjectID type or string type for the "_id" field.

petyos commented 3 years ago

Storage Operations

I do like your suggestion for defining conventions of the operations which we could apply to the storage.

I think the defined operations are good: Find, Insert, Update, Save and Delete.

I would just add that the things are bit more complex. Here is what I mean: In your suggestion you rely on this that every model entity is stored in one collection. For some cases this is true but for the model we have we could have more complex cases because of the relations.

so let's have an example with this the Device entity. This is just an example, I do not say that this entity could be stored like this but I want to explain what I mean with an example.

Let's have UpdateDevice operation.

In the simple way we could have a collections devices where all entities are stored:

[
   {
      "_id":"1234",
      "name":"device name"
   },
   {
      "_id":"5678",
      "name":"device name2"
   }
]

Ok but in the Mongo world we could store a copy of the device entity in the users collections for example(every user have devices). I cannot find the exact tutorial I have read about it but you could look here too - https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1 So in the users collection we could have

[
   {
      "_id":"1010",
      "devices":[
         {
            "_id":"1234",
            "name":"device name"
         }
      ]
   }
]

In this way when we read the User entity from the storage we will need to make just one query which will give the devices for the user too otherwise we would need to make two queries. So when we want to Update the device entity we need to do this in two collections(users and devices) with one transaction. Basically we need to decide which operations for the system should be quick and called often and which is acceptable to be slow and called rarely. In this case FindUser would be called many times in the system but UpdateDevice much more rarely. I hope I was able to explain it with this example.

So we can have the conventions you suggest, just the implementations for some of the functions would be more complex(as UpdateDevice function) but as an interface of the Storage it could provide what you suggest for the model entities.

@shurwit , let me know you you have any comments/questions/considerations.

I will comment the Data Models section soon.

petyos commented 3 years ago

Hi again @shurwit ,

Data Models

Yes, our data model should be at one place and this is in the core module.

Core module communicate with the Storage interface to store, read, update the data entities. The implementation of the Storage is hidden from the core module. This means one day we can take out Mongo and use another storage mechanism - MySQL, Oracle etc. If the data model is simple(without many relations) then we could use the core model struct directly in the Mongo storage adapter but we have data model with many relations which means that in some(most) cases the structs in the core module will not work for Mongo - remember the example I gave for the Device and User entities.

I am ok for the simple entities to set bson annotation and to use them directly in the Mongo storage adapter and to create a new one in the storage adapter where we have a complex case. What do you think about it?

On the other hand we have driver adapters(web adapter for example) which call the core module and pass a prepared data in a format suitable for core. Keep in mind that we could open APIs for many different teams(not just APIs develop by us). This means that we need to sync APIs with them. I have had cases in my experience when accepting API between client and backend(me) has been a challenge. For example the client team insists for a specific format, different field name, not giving the whole information for the entity(user for example). This should not trigger changes into our core model. In this way the web adapter prepares the data before to call core and prepare the format of the data when returns the response.

For example in the health BB we hav providers API which gives to the providers only these fields of the user entity as response:

{
  "consent": true,
  "consent_vaccine": true,
  "public_key": "string"
}

We give to them a part of the user entity.

Also keep in mind that one day we could take out the restful web adapter(where all openAPI3 documentation is and where we have the generated structs for the restful services) and use websocket adapter for example etc.

Please let me know if any comments/questions/considerations!

shurwit commented 3 years ago

Hi @shurwit ,

Thanks for bringing it up.

Object IDs

As I know in Mongo It should not have any differences in the optimizations for these two entities:

...

Hi @petyos, thanks for the feedback. The ObjectId type in MongoDB does have special optimization properties that are different from normal strings (faster sorting/indexing), and they are smaller (12 bytes). Another benefit of using these would be to keep everything consistent within the database since these ids are generated by default.

On the other hand, you are correct that using these will introduce additional complexity in the process of storing/retrieving documents (and embedded/referenced documents), as these IDs will have to be retrieved from the database and converted to strings when necessary. I also agree that we should be keeping all MongoDB specific code out of the model package.

With all of this in mind, I think for now you are right that it would be simpler to stick with the UUID string approach. At some point we may want to reevaluate the performance tradeoffs, but I do think that at this point we should be prioritizing simplicity and portability.

shurwit commented 3 years ago

Storage Operations

I do like your suggestion for defining conventions of the operations which we could apply to the storage.

I think the defined operations are good: Find, Insert, Update, Save and Delete.

I would just add that the things are bit more complex. Here is what I mean: ...

Thanks @petyos, yes my intention was to define a standard naming convention for the storage operation types. The function definitions I provided were intended to provide an example implementation of each operation for the simplest (and maybe most common) models to illustrate the purpose of each named operation. Some situations will definitely require more complex/different implementations than the examples I provided (eg. subdocuments, batch operations...etc), however I believe we should still use the same naming convention when at all possible.

shurwit commented 3 years ago

Hi again @shurwit ,

Data Models

Yes, our data model should be at one place and this is in the core module.

...

Thanks @petyos, these are all very good points. My hope was that we could find a way to simplify the process and reduce the number of models/conversions.

I am ok for the simple entities to set bson annotation and to use them directly in the Mongo storage adapter and to create a new one in the storage adapter where we have a complex case.

This is what I had in mind. In many cases, the data models are fairly simple and will be stored in a single collection exactly (or almost exactly) as they are defined in the model package. In these cases, I do not think it makes sense to define a separate model in the storage package simply to avoid adding the BSON annotations in the model package.

On the other hand, when we have complex models with many components stored across multiple collections, it does not make sense to expose this information outside of the storage package, so we can define the combined core model in the model package for convenient use throughout the application, and have separate definitions for the components of this model in the storage package as you described.

Keep in mind that we could open APIs for many different teams(not just APIs develop by us). This means that we need to sync APIs with them. I have had cases in my experience when accepting API between client and backend(me) has been a challenge. For example the client team insists for a specific format, different field name, not giving the whole information for the entity(user for example). This should not trigger changes into our core model. In this way the web adapter prepares the data before to call core and prepare the format of the data when returns the response.

This is also a very good point. I was thinking that we could do something similar to what we described above for the Mongo adapter (ie. include JSON annotations in the core models to unmarshal API requests when possible and define a different model when necessary). However, since I wrote this issue we have added in the complexity of the OpenAPI documentation tools. After reviewing #79, I believe this will introduce additional complexity while reducing flexibility in the APIs as you mentioned. With this in mind, I believe we should simply use the OpenAPI models to unmarshal API requests, then convert them to internal models as you suggested in this PR.

My only concern with both of these decisions (separate models for complex storage data and all API data) is that converting large models has gotten very messy in the past, and it is also easy to miss fields when doing these conversions. Adding new elements to existing models also became quite difficult, as we had to search for every instance where these conversions occurred and add the new element. It was very easy to miss one which would cause issues downstream without any error or warning since the models are all defined in place and missing fields are filled with default values.

I think we need to find a good standard way to handle conversion between the core models and the API and storage models. An example of this could be a function in the storage package which takes a set of related storage models and constructs a core model from them. This will allow us to avoid having to repeatedly redefine very large core models across multiple storage functions. It will also make changing these models much simpler, as we would only need to update both models and the conversion in this single function. Let me know if you have any suggestions on how to further simplify this conversion process, or any thoughts about the issues I mentioned. Thanks!

shurwit commented 3 years ago

I have updated the issue description to better reflect these decisions and improve the example storage implementations.

petyos commented 3 years ago

As a conclusion, I am moving this "In progress" for:

shurwit commented 3 years ago

Hi @petyos @stefanvit @roberlander2 @siqiz7 @jsafoan, I just wanted to mention to all of you that I recently noticed the need for an additional storage operation. Specifically, there are cases when we want to load all items in a collection. The convention that are outlined above only accounts for a Find operation where you provide search parameters and retrieve the associated value(s).

We have also been using the convention that when searching for multiple items in a collection, you can define a Find operation which returns a list (eg. for the items collection, you can have FindItem(id string) (*model.Item, error) and FindItems(ids []string) ([]model.Item, error)).

Neither of these reflects the need for a third operation which simply retrieve every possible item in this collection. For this purpose, I propose we add the Load operation (Load: no params, returns list of all items, error) with the following sample function:

func (sa *Adapter) LoadItems() ([]model.Item, error) {
    filter := bson.M{}
    var items []model.Item
    err := sa.db.items.Find(filter, &items, nil)
    if err != nil {
        return nil, err
    }

    return items, nil
}

I have updated my initial comment to add this new operation, but please feel free to suggest any changes that you think are needed. Thanks!

petyos commented 3 years ago

Hi @petyos @stefanvit @roberlander2 @siqiz7 @jsafoan, I just wanted to mention to all of you that I recently noticed the need for an additional storage operation. Specifically, there are cases when we want to load all items in a collection. The convention that are outlined above only accounts for a Find operation where you provide search parameters and retrieve the associated value(s).

We have also been using the convention that when searching for multiple items in a collection, you can define a Find operation which returns a list (eg. for the items collection, you can have FindItem(id string) (*model.Item, error) and FindItems(ids []string) ([]model.Item, error)).

Neither of these reflects the need for a third operation which simply retrieve every possible item in this collection. For this purpose, I propose we add the Load operation (Load: no params, returns list of all items, error) with the following sample function:

func (sa *Adapter) LoadItems() ([]model.Item, error) {
  filter := bson.M{}
  var items []model.Item
  err := sa.db.items.Find(filter, &items, nil)
  if err != nil {
      return nil, err
  }

  return items, nil
}

I have updated my initial comment to add this new operation, but please feel free to suggest any changes that you think are needed. Thanks!

Hi @shurwit , it makes sense.