Open jersiovic opened 8 years ago
I think eager loading can be done manually in most cases. For instance, when you do your query and get the fist level of content item ids, you can just create another request that fetches all content items from the second level (the include), then assign them to their parents. It would not help you avoid SELECT N+1, if you just make it easier to not make them. If someone uses Include() it's because they know they would otherwise do a SN+1, so they can just apply what I explained otherwise. If they don't know they should avoid SN+1 then they would not even think about using Include().
If yessql had to handle references, it would need to support identity maps, and also handle the lifetime of the collection, probably replace all objects by their ids meaning custom serialization. Lots of work and potential bugs IMO.
Eager loading can be done manually in every case, but it means Orchard won't be optimal by default unless you program it. I think that the final purpose of having those methods is having the possibility of offering to an user the posibility of setting a field for eagerly loading in a projection query through the admin UI. This will make a big difference in ease of use and performance for regular users.
Related to implementation I don't think we need a complex implementation for us at least a simple one will be enough. Identity map is only needed while the query is solved, we don't need to maintain it at request level or sth like that, so I don't see complexity on Identity map or lifetime of the collection specially because YesSql doesn't track changes so we don't need to be worried of lifetime. Lifetime of the collection should be the same lifetime currently you are using for first level content items.
Last part is the one I've less knowledge: "probably replace all objects by their ids meaning custom serialization.". The key point will be to find an elegant way of doing that leaving serialization as much automatic as we can. Maybe marking property with Navigation object within a field as non serializable using [JsonIgnore] attribute and also adding a custom attribute to this property defining the relation and involved keys. This custom attribute could be provided by YesSql and won't affect to serialization. YesSql can detect it in the type definition of the field in C# to determine how to fulfill the navigation property.
In fact, if we follow the proposal I said of design the fields in a way that they return an exception when you attempt to access them without loading them previously then you are doing Orchard optimal by default. Because if a user defines a projection query and in the view there is an access to a referenced content item through a field that is not eager loaded an exception will be thrown. We will catch that exception and will show a message saying he needs to mark that field to be eager loaded. So, Orchard will be optimal or doesn't work in all the cases. IMO this is a big improvement. Well always we can find someone that after that message adds code for loading explicitly the field in the view forcing the N+1 problem, but I think this is sth that can be avoided with an enough explanatory error message with the recommended solution.
Let's use an example to see all the impact:
[Reference]
Product
and Category
, a product can have many categories, categories are shared among products.Product.Categories
is a list and has the Reference
attribute, so that categories are not serialized in the product document, but as a list of ids: int[]
var p1 = new Product {
Name = "P1",
Categories = new [] { new Category("C1"), new Category("C12") }
}
var p2 = new Product {
Name = "P2",
Categories = new [] { new Category("C2"), new Category("C12") }
}
session.Save(p1);
session.Save(p2);
Serialized like this
{
Id: 1,
Name: "P1",
Categories: [ 2, 3 ]
}
{
Id: 4,
Name: "P2",
Categories: [ 4, 3 ]
}
Categories
is a set of references and thus we also save them. This assume the Json serialize supports this. It's ok with JSON.NET, probably not with others. It also means that the user needs to have called Save
on each reference as the serializer can't do that.Save
individually (another reason to call it, c.f. 1.1).Product
will be serialized anyway.
Category
from the list doesn't delete it. It needs to be deleted explicitly using Delete
. However if another list contains the object, there will be an issue as the item won't be found aymore. I assume this is ok as this is true even without this feature.Product
means that first the product is loaded and deserialized. We also get the ids of the references, and do a second query with the ids. It can't be done in a single query as the int[]
need to be deserialized. So the Include
helper would get all ids from different root aggregates, then create a query, then assign all the reference objects. NB: There would be no way to filter which referenced documents are included, like FirstOrDefault()
, it's all or nothing.
Maybe you just need to reify your the relationship, and save an document which would contain the information between the two, and index it. This way with one query you can get all references of multiple root aggregates, and assign them back. This could be used to filter also, this could also handle aggregation vs. composition. And with dapper we could also load the products and the categories with a single query, then ask the relationship object the referenced elements. The property would just be a marker.
It would also handle when a referenced document is deleted, all its references are automatically delete. It's funny as this is a topic I had an intern work for me on it 10 years ago.
public class ToManyAggregationIndex<ParentT, ChildT> {
public void Add(ParentT, ChildT);
public void AddRange(ParentT, IEnumerable<ChildT>);
public void Remove(ParentT, ChildT);
public void RemoveAll(ParentT);
public void GetChildren(ParentT);
}
// This represents a table named ProductCategories with two fields: ProductId and CategoryId, and
// ProductId is not unique in the table. It's an index so it can also be used for querying/filtering.
public class ProductCategories : ReferenceIndex<Product, Category> {
public Product Product {get; set; }
public IEnumerable<Category> Categories {get; set; }
}
public class ProductCategoriesIndexProvider : ToManyIndexProvider<ProductCategories> {
// ... method describing how to add and remove referenced categories for a product, like
// other indexes do for map/reduce.
}
var myProducts = session.Query<Product>().Include<ProductCategories>(c => c.Where(x => x.Name.StartsWith("C")).FirstOrDefault());
// at that point Categories have been assigned
Hahaha I love this aspect of our profession: We as devs is common we have an obsession on solving a challenging problem. It follows us for many years giving us opportunities along the time of refining our solution in unexpected ways. Sounds good, in fact alternate solution is brilliant!!
If a feature for supporting collections is developed https://github.com/sebastienros/yessql/issues/22 then this feature would be interesting that supports to define relations between documents in different collections. We would need the collection key and the document id stored in the relation to allow that scenario.
Related with the alternate solution you provided, this is how you showed an index for include should be declared:
public class ProductCategories : ReferenceIndex<Product, Category> {
public Product Product {get; set; }
public IEnumerable<Category> Categories {get; set; }
}
What if there are more references to more entities and you need all of them at some point. The optimal way of doing that is to put all the references the same index all, isn't it?
public class ProductAllReferences : ReferenceIndex<Product, Category, PromotionProductLine> {
public Product Product {get; set; }
public IEnumerable<Category> Categories {get; set; }
public IEnumerable<PromotionProductLine> PromotionProductLine{get; set; }
}
Problem is declaring ReferenceIndex using a generic with a variable number of parameters as far as I know it is not posible. So, should we provide an index per each relation?
Another question is how this will work when you want to include nested entities? For example PromotionCategoryLine with its nested Promotion (suppossing they pertain to different aggregates). We would need an extra index like this?
public class PromotionProductLineReferences : ReferenceIndex<PromotionProductLine, Promotion> {
public PromotionProductLine PromotionProductLine {get; set; }
public Promotion Promotion {get; set; }
}
And we will query in this way?
var myProducts = session.Query<Product>().Include<ProductAllReferences>().Include<PromotionProductLineReferences>();
What are other document databases doing in this case?
Siaqodb that is an hibrid of ObjectDb(Graphdb/DocumentDb provides an include method that accepts an array of strings that indicates the complete path of the entities to load.
.Include("Categories","PromotionProductLine.Promotion")
this also is valid .Include("Categories").Include("PromotionProductLine.Promotion")
https://siaqodb.com/docs/Partial-object-loading-and-Eager-loading/#eager-loading
It doesn't allow to filter entities included like you proposed .Include<ProductCategories>(c => c.Where(x => x.Name.StartsWith("C"))
The point is Siaqodb stores together with the entity the position in disk where its nested entities are stored. How it uses that I can try to figure it out but I cannot say sth for sure cause it is not open source.
A pure document db like MongoDb provides database-references https://docs.mongodb.com/manual/reference/database-references/ which rely completly on the client, so the client needs to use an extra query per nested entity. So, for complex graphs where to denormalize is not aceptable I guess performance will be bad in those scenarios.
The case is YesSQL is using a relational db so it would be great it provides similar performance/functionality as a pure document db but with better performance retrieving related entities.
What do you think of going through your initial proposal instead of the alternate one based on ReferenceIndexes? Maybe it could use an include method similar to the one provided by Siaqodb
Well initial proposal doesn't takes profit of relational nature of underlying db, so it is equivalent to document db solution for these scenarios cause it executes extra queries for retrieving nested entities.
I am on skype, can I call you ?
I've just started my holidays and I'm thinking another time about that.
What do you think of using this other approach?
We could have a new kind of indexes called RelationIndexes. We could declare one per Document Type. For example for our sample with Products, Categories, PromotionProductLine, Promotion , we could have ProductRelationIndex, PromotionProductLineRelationIndex, PromotionRelationIndex
Each RelationIndex will have following fields: int Id {get; set;} string Path {get; set;} int RelatedId {get; set;}
When a document is saved neither related objects NOR FOREIGN KEYS won't need to be stored in the document. So, not only properties [Reference] won't be serialized as you pointed previoulsly in this thread, also we won't need to store foreign keys so they won't be serialized within the document and we don't need to worry about them.
For storing/updating/delete relations data in RelationIndexes YesSQL could provide explicit methods for adding/removing relation index records.
For storing data on relation indexes YesSql could offer methods like this:
session.AddRelation<Product>(productId, c=>c.Provider, providerId);
session.AddRelation<Product>(productId, c=>c.Categories, categoryId);
session.AddRelation<PromotionProductLine>(promotionProductLineId, c=>c.Promotion, promotionId);
session.RemoveRelation<Product>(productId, c=>c.Provider, providerId);
YesSQL would provide an Include method based on info stored on relation indexes and on reflection for the type used for the include to get all the nested documents no matter how deep are them in the hierarchy with only two extra sqls queries. One for getting all the Ids and the other query for getting the contents related to those ids.:
var product=session.Query<Product>().Include<Product>(c => c.Provider).List();
var product=session.Query<Product>().Include<Product>(c => c.Categories).List();
var product=session.Query<Product>().Include<Product>(c => c.PromotionProductLines.Select(pl=>pl.Promotion)).List();
var product=session.Query<Product>().Include<Product>(c => c.Categories,c => c.PromotionProductLines.Select(pl=>pl.Promotion)).ToList();
Include would use reflection to determine how to assign values retrieved from database to a property marked as [Reference]. I mean it would behave differently to load a value for Provider which is a property of type Provider than to load a value for Categories property which is an IEnumerable
For a query like this var product=session.Query<Product>().Include<Product>(c => c.Provider).List();
Include will produce a sql like this:
SELECT TOP 1 provider.Path as level1Path, provider.RelatedId as level1Id
FROM ProductRelationIndex provider
WHERE provider.Path == 'Provider'
For a query like this var product=session.Query<Product>().Include<Product>(c => c.Categories).List();
Include will produce a sql like this:
SELECT DISTINCT categories.Path as level1Path , categories.RelatedId as level1Id
FROM ProductRelationIndex categories
WHERE categories.Path == 'Categories'
For a query like this var product=session.Query<Product>().Include<Product>(c => c.PromotionProductLines.Select(pl=>pl.Promotion)).List();
Include will produce a sql like this:
SELECT DISTINCT promotionProductLines.Path as level1Path, promotionProductLines.RelatedId level1Id
, promotion.RelatedId as level2Path, promotion.RelatedId level2Id
FROM ProductRelationIndex promotionProductLines
INNER JOIN PromotionProductLinesRelationIndex promotion ON promotion .Path = 'Promotion' and promotionProductLines.RelatedId = promotion.Id
WHERE promotionProductLines.Path == 'PromotionProductLines'
For a query like this var product=session.Query<Product>().Include<Product>(c => c.Categories,c => c.PromotionProductLines.Select(pl=>pl.Promotion)).ToList();
Most simple solution is Include produce a sql like this:
SELECT DISTINCT promotionProductLines.Path as level1Path, promotionProductLines.RelatedId level1Id
, promotion.RelatedId as level2Path, promotion.RelatedId level2Id
FROM ProductRelationIndex promotionProductLines
INNER JOIN PromotionProductLinesRelationIndex promotion ON promotion .Path = 'Promotion' and promotionProductLines.RelatedId = promotion.Id
WHERE promotionProductLines.Path == 'PromotionProductLines'
UNION ALL
SELECT DISTINCT categories.Path as level1Path , categories.RelatedId as level1Id, 0 as level2Path, 0 as level2Id
FROM ProductRelationIndex categories
WHERE categories.Path == 'Categories'
What do you think? Can this be the start of sth?
I forgot to show a sample of a relation between a property of a nested object withing a document and another document. Following with the same types we are using. Let's think that Promotion and PromotionProductLines are stored in the same document.
If we want to add a relation between a PromotionProductLine and a Product in PromotionRelationIndex we will add a record with Path='PromotionProductLines.Product'
Then an include like this var promotion=session.Query<Promotion>().Include<Promotion>(c => c.PromotionProductLines.Select(p=>p.Product)).List();
will produce an sql like this:
SELECT DISTINCT promotionProductLines_product.Path as level1Path ,promotionProductLines_product .RelatedId as level1Id
FROM PromotionRelationIndex promotionProductLines_product
WHERE promotionProductLines_product.Path == 'PromotionProductLines.Product'
An edge case is when you want to reference from a document an object within another document or even more difficult you want to reference from an object within a document anoter object within another document. For example if we have an Order document that contains OrderLines that have a Product which has a relation with PromotionProductLines like we did before but PromotionProductLines is stored within Promotion document. So, we have two types of documents: orders and promotions with a relation between their nested objects. For this sensible scenario we should try to extend our solution to cover that case.
A solution is we redefine the fields of a RelationIndex
int SourceId {get; set;} //Source document id
string SourcePath {get; set;} //Path to nested object in source document
string SourcePathIds {get; set;} //Path of ids to reach nested objects in source document
int TargeId {get; set;}
string TargetPath {get; set;}
string TargetPathIds {get; set;}
Then maybe we will need to extend the [Reference] annottation for indicating the target root type containing the nested object and the path to the target nested class, but not sure if we need it.
public class Product
{
...
[Reference<Promotion>(p=>p.PromotionProducLines)]
public IEnumerable[PromotionProductLine] PromotionProductLine
...
}
We could have new overloads for methods for adding/removing RelationIndexes of nested objects:
session.AddRelation<Order>(orderId, o=>o.OrderLines.Select(o=>o.Product), "1/2", productId, promotionId, p=>p.PromotionProductLines, "3", promotionProductLineId);
session.RemoveRelation<Order>(orderId, o=>o.OrderLines.Select(o=>o.Product), "1/2", productId, promotionId, p=>p.PromotionProductLines, "3", promotionProductLineId);
We have to keep in mind that session.AddRelation() and session.RemoveRelation() methods will change always two index tables one per entity participating in the relation.
Then an include like this var product=session.Query<Order>().Include<Order>(c => c.OrderLines.Select(ol=>ol.Product.PromotionProductLines)).List();
what sql will produce?.
SELECT DISTINCT promotion .SourceId as level1SourceId, promotion .SourcePath as level1SourcePath, promotion.SourcePathIds as level1SourcePathIds, promotion .TargetId level1TargetId, promotion .TargetPath as level1TargetPath, promotion.TargetPathIds as level1TargetPathIds,
FROM OrderRelationIndex promotion
WHERE promotion .Path == 'OrderLine.Product.Promotion'
This sql would provide the ids of the documents to load and the ids and path to get related entities within both documents. The include would do the work of assigning related values in the documents tree using reflection.
Two corrections to previous comment.
When I said:
We have to keep in mind that session.AddRelation() and session.RemoveRelation() methods will change always two index tables one per entity participating in the relation.
Instead of that we should have one RelationIndex table per relation not per documentType. At the end of the day what I'm proposing is to use a many to many table to represent also many to one or one to one relations. We can name this index ordering lexycografically the two document type names involved in the relation. Following this convention no matter which of the two entitites is the source or the target in an include we will read always same table if involved entities are the same.
So, another refinement for RelationIndex. Index name will be DocumentType1_DocumentType2_RelationIndex and its fields should be renamed to this:
int Document1Id {get; set;}
string Document1Path {get; set;}
string Document1PathPosition {get; set;}
int Document2Id {get; set;}
string Document2Path {get; set;}
string Document2PathPosition {get; set;}
Then an include like this var product=session.Query<Order>().Include<Order>(c => c.OrderLines.Select(ol=>ol.Product.PromotionProductLines)).List();
will produce this sql:
SELECT DISTINCT promotion.Document1Id as SourceId, promotion.Document1Path as SourcePath, promotion.Document1PathIds as SourcePathPosition,
promotion.Document2Id level1TargetId, promotion.Document2Path as level1TargetPath, promotion.Document2PathIds as level1TargetPathPosition
FROM Order_PromotionRelationIndex promotion
WHERE promotion.Document1Path == 'OrderLines.Product.Promotion'
Then if we define a property for the other side of the relation with [Reference] annottation like this:
public class PromotionProducLine {
...
[Reference<Order>(o=>o.OrderLines)]
public IEnumerable<OrderLine> OrderLines
...
}
We will be able of using an include navigating in the opposite direction of the relation, using an include like this var promotion=session.Query<Promotion>().Include<Promotion>(promo => promo.PromotionProductLines.Select(ppl => ppl.OrderLines)).List(); Which will produce an sql based on the same table as previous one:
SELECT DISTINCT orderLines.Document2Id as SourceId, promotion.Document2Path as SourcePath, promotion.Document2PathIds as SourcePathIds,
promotion.Document1Id level1TargetId, promotion.Document1Path as level1TargetPath, promotion.Document1PathIds as level1TargetPathIds
FROM Order_PromotionRelationIndex orderLines
WHERE orderLines.Document2Path == 'PromotionProductLines.OrderLines'
In this example we want to relate Product on position 0 within OrderLine on position 2 withint root object on Document 1 of type order with PromotionProductLines on position 3 within root object of type Promotion on Document 2.
session.AddRelation<Order,Promotion>(orderId, o=>o.OrderLines.Select(o=>o.Product), "2/0", productId, promotionId, p=>p.PromotionProductLines, "3", promotionProductLineId);
IMO the weak part of this proposal is that the user code needs to provide the position of the objects in the path to a nested object involved in a relation when it wants to add a RelationIndexRecord. And also it needs to keep those positions in sync if those are changed.
Another limitation is that we only can create relations between nested objects in separated documents when in the hierarchy to reach the objects are present propertys of types serialized in the document or IEnumeables. However IMO this is not a big limitation, in fact it looks enough for most common scenarios.
The advantage will be that YesSQL will allow us to use document database power of loading very fast big hierarchies in a document but also will provide a fast way of loading branches in that hierarchy stored in other documents. Avoiding the only alternative we have currently on document dbs that is to denormalize. So, with YesSql we would have two options: denormalize causing other issues related to synchronization or creating relations between documents with low impact on performance but a bit of extra work for syncing relations on changes.
Well, I think I lied, because without include you always can use specific code for each scenario where you want to combine data loaded from different documents. At the end what include provides is a convenient and flexible way of doing it without writing very similar code again and again for each scenario.
@sebastienros I've started to work on this issue to see which limitations I face. By the moment my initial design has changed:
Here it is my first commit, it is WIP and in fact only a first step. https://github.com/jersiovic/yessql/commit/c2f0ece7602277bbbe6b9c28df8d62e54124b147
I point it to you just in case you want to give me your opinion.
By the moment RelationIndexes are MapIndexes with predefined set of fields and with hepers that allow developer to focus only on set its relations.
In order to avoid to have to load always relations on an entity and to update on each commit all its relations I decided that when a property with a Refrence attributte is null it means that relation was not loaded, so, no update of records of that index for thtat property are needed.
When our reference property is an IEnumerable like Order.OrderLines
this is fine, cause when we want to remove relations we can do OrderLines.Clear()
, but what if it is only an object like it coud happen with Order.Customer
? We only can set it to null or to a Customer.
What's the best solution in your opinión? To use a generic Object that works similar to Nullable for reference properties, allowing us to differentiate between uninitialized state and empty state. Another alternative I can't see?
A constraint of current implementation is that subentities within a document tree that contain a property with the Reference tag have to contain an Id property, and the same happen for the referenced entity in the target entity of the relation. Those ids are part of the info stored in relation indexes and will help Include methor to build the resulting Object tree. By the moment if a new subentity with a reference property is added the Session.Save methods assigns an unique Id to each of those new entities.
To-Do:
Any progress on this? :)
Mentiond in #355, I have a use case that seems it would require something similar to this.
OrchardCore ContentItems with a ContentPart containing a field with a JObject. All of the parts' JObjects' keys are aggregated into a Keys table (with a unique row for each unique key). Then a Structures table exists that maps the Keys Id columns to a StructureId so that one StructureId has multiple rows matching it to various KeyId values. Example:
Keys Id, Name 1, Apple 2, Banana 3, Orange
Structures Id, KeyId 1, 1 1, 2 2, 1 2, 2 2, 3
Edit:
Got my above use case working with the following 2 ReduceIndexes.
public class RecordKeyIndex : ReduceIndex
{
public string KeyName { get; set; }
public int Count { get; set; }
}
public class RecordStructureIndex : ReduceIndex
{
public int KeysHashId { get; set; }
public int Count { get; set; }
}
In each case, make the non-Count value the key. For the structure index, calculate an integer hash for a given combination of keys in the dynamic data. The Count
properties exist purely so that you know when it is safe to delete the row from the table. Since each ReduceIndex also generates a *_DocumentId
table that matches its row Ids to DocumentIds, that means you automatically know...
If a hash-to-keys lookup is needed, you can just add a JArray field that stores a copy of the keys in the structures table.
To do this with ReduceIndexes though, I see a few issues.
~~1. You need the StructureId to NOT be an auto-incremented ID since multiple rows reuse the same ID. Is this configurable in YesSql per Index table somehow?
Map
/Group
/Reduce
functions, etc..Map()
logic for the StructuresIndex (a type of ReduceIndex
).~~Ideally you could just "include" the keys table into the evaluations of the structures index and have direct access to an Id column if present. But the examples I've seen so far don't have the Index types actually storing the int
Id from their table schema. They do seem to have references to the Document IDs they are related to, but we would need Id columns from a separate index table.
It seems to me like I'd need to, for now, just stick to independently creating and processing these tables in an IContentHandler
in OrchardCore that just reacts to the saving/loading of the data rather than trying to create YesSql indexes for those tables.
Wouldn’t the Apple ID just be a guid, and used as the discriminator?
Pure relational dbs have its pros and cons, pure document dbs also have its pros and cons. To have a nosql db based on an sql db is an smart movement to have best of both worlds easily available to solve the different problems we find developing an app.
Following this philosophy it would be interesting that Yessql also support relations between documents. Usually we see those relations in ORMs for lazy loading. Personally I'm against the use of Lazy Loading because in the long term devs tend to overuse it hurting performance of the system as have seen many times in Orchard 1 when a content type has content fields pointing to other content items.
What it is interesting about relations is that YesSql can offer a kind of Include() method for queries, helping us to avoid easily N+1 queries in a transparent way. Because we will delegate on YesSql the work of optimizing the operation for getting the related documents of resulting documents of a query using only one extra query for retrieving documents on the other side of a relation.
In Orchard 2 this will be very useful to deal with fields pointing to other content items in an optimal way. I've added a similar solution based on an extension for IContentQuery applicable for content fields and other content items like content fields, taxonomy fields, MediaPickerFields that have boosted performance in Orchard 1 sites, problem it still is not enough clean and elegant as I would like to submit a PR ... :)
To declare a relation we will need to mark "navigation fields" with metadata stablished by YesSql that of course will be independent of Orchard Concepts. Those metadata will stablish keys involved on the relation and the type of the relation: Many-To-One, One-To-One and Many-To-Many. Orchard fields depending of its type will fulfill this metadata with its serialization according to its needs. In the orchard side Fields should be developed in a way that they cannot be accessed if its navigation property have not been loaded through include extension. But YesSql should offer a way of explicitly load one field based on a relation. IMO the important thing is to don't offer lazy loading cause it is the root cause of performance issues done by devs, but give them an explicit load alternative to help them to be conscious of what they are doing.
To keep things simple for storing documents and not be worried on tracking changes on related documents, we could only store documents explicitly asked to store. But YesSql won't be responsible of storing related documents of documents explicitly stored.