Switch ManyToMany fields to using JOINS instead of WHERE...IN

jonathangeiger commented 14 years ago

This needs to be performance tested, but ccazette brought up the fact that the former is most likely faster for large result sets.

kolanos commented 14 years ago

It should alternate between the two based on the size of the result set. Performance testing would determine the tipping point where WHERE IN() becomes slower than a JOIN. Not sure what that point is, but if performance is a priority, it should do it both ways.

ccazette commented 14 years ago

I am testing right now. Will let you know.

banks commented 14 years ago

It should alternate between the two based on the size of the result set. Performance testing would determine the tipping point where WHERE IN() becomes slower than a JOIN

I think the performance gain for such a complexity will be negligible and will depnd on a lot more variables than just size of result set - row length, mysql caching, memory allocation, processor speed, server load, networking limitations etc. Also, how do you know size of result set beforehand without another query?

I think Join is the best solution overall and we should stick to just that.

if performance is a priority

Only up to a point. If performance was a real priority, you probably shouldn't be using an ORM at all. ORM is about DRY and increased developmenty speed. There is always a performance compromise there.

Use joins definitely as they will scale and will be have negligible performance difference on small result sets but I see no argument for complex code that tries to optimize queries. If performance is that big a deal, optimize your app/don't use ORM for the critical queries...

ccazette commented 14 years ago

@banks : the point is Jelly is not using JOINs at the minute, but WHERE ... IN statements instead !

I agree with you, and this is the whole point of this request. If using joins is more efficient than where ... is, why not make Jelly more performant overall, since there is no complexity in changing this ? But that would definitely be using one or the other, and not a combination of the two.

banks commented 14 years ago

the point is Jelly is not using JOINs at the minute, but WHERE ... IN statements instead !

I got that! I agree that should change, just not that we should try to intelligently optimize which would be a waste of effort.

ccazette commented 14 years ago

OK, sorry, I got confused by your comment. :-). I totally agree then.

Here are my findings if you wonder : suprisingly, WHERE IN and JOINS performs pretty much the same on large result sets in my tests. However, queries are more affected by the pivot table row length. On a classic pivot table (2 columns of IDs), JOINs is 10% faster than WHERE IN. WHERE IN proves to be slightly faster on longer rows. Anyway, to give more arguments in favor of JOINs:

The first argument that occurs to me is that apparently Jelly could never use intermediate data (columns in the pivot table) by using WHERE IN. That would be a nice feature to have, really. ORM supports this.
Logically speaking, and may be in terms of readability, JOINS are more readable when you look at the overall query. WHERE IN is more like a RULE, so it makes less sense IMHO.
Most of the pivot tables have 2/3 columns, in which case JOINs prove to be slightly (shall I insist on the slightly ?) faster.

So yeah, may be not a priority, but I think like Banks that Jelly should definitely use JOINs instead.

banks commented 14 years ago

WHERE IN also limits you on the size of the result set since mysql query length is limited to 1 MB in standard installs... If you have 10,000 rows to 'join' your query could get longer than that limit easily.

This may be a bit of an edge case but I think on balance 'JOIN' is the 'right' way to do this.

jonathangeiger commented 14 years ago

Closed by ac19026efc4fce4bbede4817924a444e63a4d551. Thanks for discussion everyone and thanks for the benchmarks ccazette!

kolanos commented 14 years ago

I wouldn't have guessed that JOINs are as fast or faster than WHERE IN()s in most cases, good to know. JOINs get a bad rap, I guess.

jonathangeiger commented 14 years ago

Sorry I meant to say thanks to ccazette, not raeldc. I edited the post but I can't edit the commit message.

10,000 thanks to you ccazette!

dotcreative commented 14 years ago

I tested both where in and join and on my project it is quite a big difference between join and where in, in terms of performance. Join is faster.

But...there is a big change in how manytomany behaves now. Let me explain with an example.

I have around 1000 products, that belong both to a category and to a brand. For each category I want to show a list of the brands. My 'through' table in this case is products. Using where in i could see the brands by simply showing $category->brands. Which is somehow logical.

Now, it actually selects also the products fields (for example product title overwrites the brand title) and is selecting the brand every time there is a product. Which is not exactly desired. I can go around this, but the default behavior shouldn't be the more obvious one?

Join is faster and I changed the manytomany get behavior to reflect the old one. Can I get some feedback on the implications of this changes:

    return Jelly::select($this->foreign['model'])
                ->distinct(TRUE)
                ->select($this->foreign['model'].'.*')
                ->join($this->through['model'])
                ->on($join_col1, '=', $join_col2)
                ->where($where_col, '=', $model->id());

ccazette commented 14 years ago

Yes, as I underlined in my first point a few days ago, using JOINs also makes Jelly able to deal with 'through' tables and models. This fix has been implemented to replace the WHERE IN which could not allow to get their columns, but the reflexion on how to handle this information has not been made AFAIK, or at least was not discussed here... So you are right, behind this replacement lies the question : how should Jelly handle 'through' table additional informations ? I suppose it should be handled like a with...

dotcreative commented 14 years ago

If by "should be handled like a with" you mean that $category->brands should return the distinct brands and $brand->products returns the products that belong to both category and brand (supposing that $brand is the result of foreach on $category->brands), then I would totally agree. It would be the desired way to handle this.

ccazette commented 14 years ago

Not sure this is working... You're first attempt was working as a workaround, since you had to use distinct() to make it work. This is not a 'real' manytomany relationship.

In a proper manytomany, in you know model1 and model2, you should have only one 'through' result. That means in your case you should have only one product per category and per brand if you want to be in a real manytomany use case. What I meant would work like you suggested in this case. You'd have $category->brands returning all brands. If you loop through these brands and then call $brand->product, you should be able to return the through table used, but that should return only one product. That would be IMO the correct way to handle through tables.

In your case, in would think either about creating a new field type that would iron out problems of having many products for 1 brand and 1 category, for instance using distinct on something largely inspired from Jelly ManyToMany, either extend the query builder for category and brands to do that neatly using an homemade query(this is an awesome feature of Jelly). Now if you really don't feel like doing this, you could have proper 'through' tables such as brand_categories, and so on...

dotcreative commented 14 years ago

Thanks, first of all I'm kind of a beginner with databases and SQL, so sometimes I can say stupid things about this matter :).

Having brand_categories, is excluded, as from my point of view is completely redundant, I already have products table for that.

Regarding having only one through result, I agree, but sometimes, like this case, it a logical to have many through results. Why woudn't be possible to have all the results from the through table? It's just as simple as having one.

dotcreative commented 14 years ago

I looked extending Jelly Builder. It's a very nice solution. Can I use this to extend the default result, without having to add an extra function? If yes, how?

ccazette commented 14 years ago

It is possible, just that you're not using the correct relationships to do it. This is not a manytomany. What if someone doesn't want to use distinct() ? I would have :

products belongs_to categories, belongs_to brands
categories has_many products
brand has_many products

And you would make either a special field or a special builder function as specified here (http://jelly.jonathan-geiger.com/docs/jelly.extending-builder) for brands and categories to be able to get one from the other joining with products in a homemade query (probably using distinct as you used to). But let's not forget this thread should be re-opened (don't know how to do it) to handle through tables. Your problem would totally apply here if you had only one product per category and table. I've put my thoughts in the comment above, I think when you use manytomany, you should have the trough table info into the result models, just as you have when using with(). Cale, you would then probably use that when it's fixed, change the field name to whatever you fancy, add distinct() in the field class queries, and that should probably get you sorted.

ccazette commented 14 years ago

Just saw your last reply. If you want the trough table info embedded in your results, then you should probably copy the manytomany field when it'll support it, tweak it a little and use it in categories and brands. The builder probably won't allow you to do this. Let's not pollute this thread as it is getting off-topic. If you have more questions, use the main forum thread.

dotcreative commented 14 years ago

Thanks, sorry for the offtopic question.

jonathangeiger commented 14 years ago

But let's not forget this thread should be re-opened (don't know how to do it) to handle through tables.

This is a separate issue, so I suggest you create it instead of re-opening this.

jonathangeiger / kohana-jelly

Switch ManyToMany fields to using JOINS instead of WHERE...IN #42