False Positive for Rails/PluckInWhere

natematykiewicz commented 4 years ago

Rails/PluckInWhere assumes that the pluck is being called on an ActiveRecord relation.

data = Rails.cache.fetch('iterations') { Api.iteration_overview }
indexed_iterations = Iteration.where(number: data.pluck('number')).index_by(&:number)

data is not an ActiveRecord relation, it's an array of hashes. The pluck here is using ActiveSupport's Enumerable#pluck.

Expected behavior

Call pluck on an array of hashes, and not have Rails/PluckInWhere report an issue.

Actual behavior

Call pluck on an array of hashes, and have Rails/PluckInWhere report "Use select instead of pluck within where query method."

Steps to reproduce the problem

arr = [{foo: 1}, {foo: 2}]
Foo.where(val: arr.pluck(:foo))

RuboCop version

Rubocop 0.88.0, Rubocop-Rails 2.6.0

mobilutz commented 4 years ago

I just stumbled over this false positive as well.

Maybe a check needs to be added if the class that pluck is being called on is an ActiveRecord::Relation or something else.

tilo commented 2 years ago

we had a situation where this Rubocop rule caused the MySQL query planner to not optimize the query correctly, and overwhelmed the DB.

.pluck(:id) returns an array of numeric IDs .select(:id) returns the instances with an 'id' attribute - which is incorrect in the context we observed

the correct "select" statement in our case would have been .select(:id).map(&:id) which is not preferable to .pluck(:id)

@koic I'm not convinced that this is a good and general Rubocop rule

natematykiewicz commented 2 years ago

@tilo can you provide more details?

.select(:id) returns the instances with an 'id' attribute

This is only true if you've called load on the relation, which can happen if you directly call load or any enumerable method such as each or map.

If you're just nesting relations, it does a subquery.

Foo.where(bar_id: Bar.active.select(:id))

This does 1 query:

SELECT *
FROM foos
WHERE bar_id IN (
  SELECT id FROM bars WHERE active
);

If you use pluck, it does 2 queries:

Foo.where(bar_id: Bar.active.pluck(:id))

1 to grab the ids of all the relevant Bars.

SELECT id FROM bars WHERE active;

And 1 to get the Foo records using those Bar ids.

SELECT *
FROM foos
WHERE bar_id IN (1, 2, 3);

natematykiewicz commented 2 years ago

Also you said:

the correct "select" statement in our case would have been .select(:id).map(&:id) which is not preferable to .pluck(:id)

But .pluck(:id) would result in exactly the same query plan and SQL as .select(:id).map(&:id), just the latter will instantiate a bunch of ActiveRecord objects to get the IDs, where the former will just have an array of IDs.

Both of those would do 2 queries -- 1 to select the ids and 1 to grab your rows.

Can you provide some relevant code?

natematykiewicz commented 2 years ago

Perhaps what you're meaning to say is that you wanted .pluck(:id) because a subquery of .select(:id) overwhelmed the query planner, but this cop pushed you to use a subquery?

tilo commented 2 years ago

I wanted to use pluck(:id), but this cop auto-corrected it to use select(:id).

Original PR, using pluck(:id) resulted in this Query Plan:

> HugeTable.where(item_type: 'MyModel', item_id: MyModel.where(user_id: @user.id).pluck(:id)).explain
=>
EXPLAIN for: SELECT `hugetable`.* FROM `hugetable` WHERE `hugetable`.`item_type` = 'MyModel' AND `hugetable`.`item_id` IN (1442298, 2304566, 2304574, 2608158)
+----+-------------+----------+------------+-------+-----------------------------------------+-----------------------------------------+---------+------+------+----------+-----------------------+
| id | select_type | table    | partitions | type  | possible_keys                           | key                                     | key_len | ref  | rows | filtered | Extra                 |
+----+-------------+----------+------------+-------+-----------------------------------------+-----------------------------------------+---------+------+------+----------+-----------------------+
|  1 | SIMPLE      | hugetable | NULL       | range | index_hugetable_on_item_type_and_item_id | index_hugetable_on_item_type_and_item_id | 1026    | NULL |   36 |    100.0 | Using index condition |
+----+-------------+----------+------------+-------+-----------------------------------------+-----------------------------------------+---------+------+------+----------+-----------------------+

uses an index

Modified PR, after using rubocop -A, using select(:id), resulted in this Query Plan: (rule Rails/PluckInWhere)

> HugeTable.where(item_type: 'MyModel', item_id: MyModel.where(user_id: @user.id).select(:id)).explain
=>
EXPLAIN for: SELECT `hugetable`.* FROM `hugetable` WHERE `hugetable`.`item_type` = 'MyModel' AND `hugetable`.`item_id` IN (SELECT `my_models`.`id` FROM `my_models` WHERE `my_models`.`user_id` = 1516073) 
+----+-------------+-----------------------+------------+------+------------------------------------------------+-----------------------------------------+---------+----------------------------------------------+------+----------+-------------+
| id | select_type | table                 | partitions | type | possible_keys                                  | key                                     | key_len | ref                                          | rows | filtered | Extra       |
+----+-------------+-----------------------+------------+------+------------------------------------------------+-----------------------------------------+---------+----------------------------------------------+------+----------+-------------+
|  1 | SIMPLE      | my_models | NULL       | ref  | PRIMARY,index_my_models_on_user_id | index_my_models_on_user_id  | 5       | const                                        |    4 |    100.0 | Using index |
|  1 | SIMPLE      | hugetable              | NULL       | ref  | index_hugetable_on_item_type_and_item_id        | index_hugetable_on_item_type_and_item_id | 1026    | const,prod_db.my_models.id |    5 |    100.0 | NULL        |
+----+-------------+-----------------------+------------+------+------------------------------------------------+-----------------------------------------+---------+----------------------------------------------+------+----------+-------------+

This added a sub-query, and the MySQL optimizer did no longer use an index on HugeTable! 😱 e.g. NULL HugeTable has around 1.6B rows.. 😱 (yes, we need to get rid of that table)

The select which caused the sub-query was added by a RuboCop auto-correct for rule Rails/PluckInWhere

> MyModel.where(user_id: @user.id).select(:id)
   => #<ActiveRecord::Relation [#<MyModel id: 1442298>, #<MyModel id: 2304566>, #<MyModel id: 2304574>, #<MyModel id: 2608158>]>

this is returning instances of MyModel, instead of numeric IDs

Perhaps the auto-corrected query with select should have been: .map(&:id):

> HugeTable.where(item_type: 'MyModel', item_id: MyModel.where(user_id: @user.id).select(:id).map(&:id)).explain
=>
EXPLAIN for: SELECT `hugetable`.* FROM `hugetable` WHERE `hugetable`.`item_type` = 'MyModel' AND `hugetable`.`item_id` IN (1442298, 2304566, 2304574, 2608158) 
+----+-------------+----------+------------+-------+-----------------------------------------+-----------------------------------------+---------+------+------+----------+-----------------------+
| id | select_type | table    | partitions | type  | possible_keys                           | key                                     | key_len | ref  | rows | filtered | Extra                 |
+----+-------------+----------+------------+-------+-----------------------------------------+-----------------------------------------+---------+------+------+----------+-----------------------+
|  1 | SIMPLE      | hugetable | NULL       | range | index_hugetable_on_item_type_and_item_id | index_hugetable_on_item_type_and_item_id | 1026    | NULL |   36 |    100.0 | Using index condition |
+----+-------------+----------+------------+-------+-----------------------------------------+-----------------------------------------+---------+------+------+----------+-----------------------+

and pluck(:id) is simpler than .select(:id).map(&:id) so maybe not use this rule..

tilo commented 2 years ago

PM me for more details :)

tilo commented 2 years ago

@natematykiewicz it did not result in the same query plan (see above)

natematykiewicz commented 2 years ago

What I was saying is .select(:id).map(&:id) is the same SQL and Query Plan as .pluck(:id), but much less efficient in terms of Ruby object allocations. There's never a time when you want to use .select(:id).map(&:id). You should be using either .pluck(:id) (which does 2 queries) or .select(:id) (which does 1 query using subselect).

I'm a bit shocked that your subselect query performed so poorly (.select(:id)). I wouldn't be surprised if huge_tables.item_id was not the same datatype as my_models.id; perhaps one is bigint and the other is int, and it's doing a poor job typecasting them mid-query. Or maybe you need to ANALYZE or VACUUM the tables at play here.

@koic Thoughts on marking this cop "unsafe", since there's no guarantee that a subquery will actually perform better than 2 queries?

tilo commented 2 years ago

@koic WDYT?

koic commented 2 years ago

This cop has already been marked as unsafe. OTOH, the doc could be updated. https://docs.rubocop.org/rubocop-rails/cops_rails.html#railspluckinwhere

tilo commented 2 years ago

Thank you!

rubocop / rubocop-rails