jeremyevans / sequel

Sequel: The Database Toolkit for Ruby
http://sequel.jeremyevans.net
Other
4.99k stars 1.07k forks source link

fix empty? for datasets containing null #1990

Closed andy-k closed 1 year ago

andy-k commented 1 year ago

#empty? was not returning false for some non-empty datasets.

irb(main):001:0> require "sequel"
=> true
irb(main):002:0> DB = Sequel.sqlite
=> #<Sequel::SQLite::Database: {:adapter=>:sqlite}>
irb(main):003:0> DB["select null"].empty?
=> true
irb(main):004:0> DB["values (null, 1), (2, 3)"].empty?
=> true
irb(main):005:0> DB.values([[nil, 1], [2, 3]]).empty?
=> true
irb(main):006:0> 
jeremyevans commented 1 year ago

Would you mind sharing how the code prevents loading a large amount of data? I'm not yet familiar enough with the library and was unaware of this case and how to trigger it. Moreover, I'm not sure which part of the above transformation would break this (loading a large amount of data) case specifically.

Consider the following:

DB.create_table(:t){String :s, :text=>true}
DB[:t].insert('1'*100_000_000)

DB[:t].empty?
# SELECT 1 AS 'one' FROM `t` LIMIT 1

DB['SELECT * FROM t'].empty?
# Before and with your change:
# SELECT * FROM t
# With my recommended change:
# SELECT 1 AS 'one' FROM (SELECT * FROM t) AS 't1' LIMIT 1

Your change fixes the return value (good), but it will load a large value (bad). My approach fixes the return value, but avoids loading a large value. You can benchmark the above cases, but my approach is probably at least 50x faster in the above example, compared to yours.

The Dataset#values case is different. For both SQLite and PostgreSQL, the VALUES statement requires at least one row, so Dataset#empty? could always return true without a query in that case. We should probably make Dataset#values raise for empty values.

I can build on top of this pull request and fix the remaining issues. This will need integration tests and testing on all databases Sequel supports to ensure correct behavior.

andy-k commented 1 year ago

I see. DB['SELECT * from t'].empty? loads the large value in the "with my change" case. I misunderstood your comment as that I had introduced a regression, and am now relieved that I hadn't; this is exactly the same behavior as the "before" case. In this PR, I was primarily interested in fixing the return value of #empty?.

select 1 from ... would definitely help performance in the specific case when the original query selects a big value and that big value happens to be the first row returned. But that's rare, and there is a minimal cost to build that wrapper; there is a reason we aren't just wrapping everything with #from_self regardless whether it's a custom SQL, a #values, or just DB[:t]. (With PostgreSQL (but not SQLite), you can even select from t and get rows of nothings.)

I disagree to have all custom SQL automatically wrapped in #from_self. It is easy enough for the caller to do DB['SELECT * from t'].from_self.empty? manually when they believe this is a real optimization that applies in their case; in most cases, the query looks more like DB['SELECT * from t order by something'].empty? and that's not being helped by #from_self. When the caller already constructs an .empty?-friendly query -- in my case I constructed select null from ... and found this issue -- it is probably not beneficial to (incur GC and) forcibly convert that to select 1 from (select null from ...). I might be in the minority, I had assumed def empty? = !first all along and actually expected the whole first row to be loaded... 😅

Yes, feel free to build on top of this pull request, optimize Dataset#values, and so on.

Optimizations and different approaches aside, would you agree that this pull request addresses the core issue (the return value of #empty?) without any regressions?

Thanks for the library!

jeremyevans commented 1 year ago

select 1 from ... would definitely help performance in the specific case when the original query selects a big value and that big value happens to be the first row returned. But that's rare, and there is a minimal cost to build that wrapper; there is a reason we aren't just wrapping everything with #from_self regardless whether it's a custom SQL, a #values, or just DB[:t]. (With PostgreSQL (but not SQLite), you can even select from t and get rows of nothings.)

The wrapper is cached, so it's only built the first time it is needed for the dataset. It's true that we aren't using from_self in all cases. There isn't a reason to use it if you know you don't need it.

I disagree to have all custom SQL automatically wrapped in #from_self. It is easy enough for the caller to do DB['SELECT * from t'].from_self.empty? manually when they believe this is a real optimization that applies in their case; in most cases, the query looks more like DB['SELECT * from t order by something'].empty? and that's not being helped by #from_self.

Without the from_self, if any column in t is large, this can significantly slow down empty?. Note that there are lots of other cases in Sequel where we wrap datasets with custom SQL in from_self, such as:

DB['SELECT * FROM t'].count
# SELECT count(*) AS "count" FROM (SELECT * FROM t) AS "t1" LIMIT 1

So following this practice with empty? is consistent with Sequel's behavior in other cases, and a good idea for the reasons I've discussed.

When the caller already constructs an .empty?-friendly query -- in my case I constructed select null from ... and found this issue -- it is probably not beneficial to (incur GC and) forcibly convert that to select 1 from (select null from ...). I might be in the minority, I had assumed def empty? = !first all along and actually expected the whole first row to be loaded... 😅

Asserting one approach is faster than another without benchmarking is a risky business. Here's a much more reasonable example with a table with only 6 columns:

DB.create_table(:t) do
  primary_key :id
  5.times do |i|
    String :"c#{i}"
  end
end
s = '1' * 50
DB[:t].insert(:c0=>s, :c1=>s, :c2=>s, :c3=>s, :c4=>s)

ds = DB['SELECT * FROM t']
require 'benchmark/ips'
Benchmark.ips do |x|
  x.report('empty?'){ds.empty?}
end

Results with your approach:

empty?     12.734k (_ 0.8%) i/s -     64.719k in   5.082693s

Results with my approach:

empty?     18.525k (_ 1.5%) i/s -     94.401k in   5.097011s

Yes, feel free to build on top of this pull request, optimize Dataset#values, and so on.

OK. I'll build on top of this and make sure this issue is fixed before the next release.

Optimizations and different approaches aside, would you agree that this pull request addresses the core issue (the return value of #empty?) without any regressions?

Certainly your pull request fixes the issue for datasets with custom SQL and those using values, and I thank you for that.

andy-k commented 1 year ago

Thanks for merging and for adjusting!

Unfortunately, it seems the commit that follows the merge (a638268ce08892cd06d7b858a2b9770f0ed2e183) broke this.

# this PR #1990 made DB.values(...).empty? return false
$ git checkout a638268ce^
HEAD is now at 23534e85d fix empty? for datasets containing null
$ ruby -r './lib/sequel' -e 'p Sequel.sqlite.values([[nil]]).empty?'
false

# the commit following it made it return true again
$ git checkout a638268ce
Previous HEAD position was 23534e85d fix empty? for datasets containing null
HEAD is now at a638268ce Avoid loading full row in Dataset#empty?
$ ruby -r './lib/sequel' -e 'p Sequel.sqlite.values([[nil]]).empty?'
true

# and that's the same as how it was before
$ git checkout a638268ce^^
Previous HEAD position was a638268ce Avoid loading full row in Dataset#empty?
HEAD is now at a534d716d Bump version to 5.65.0
$ ruby -r './lib/sequel' -e 'p Sequel.sqlite.values([[nil]]).empty?'
true

It seems that the return false added in the commit is not executed...

jeremyevans commented 1 year ago

I think your issue is you are only requiring the Sequel library without changing the load path, so you are not loading the sqlite adapter for the checkout of Sequel, your are probably loading the sqlite adapter from an installed sequel gem.

There was a spec added for the behavior, and you can also see it works correctly on the command line:

$ bin/sequel sqlite:/ -c 'p DB.values([[nil]]).empty?'    
false
andy-k commented 1 year ago

You're right, there's no regression. Thanks for patiently sharing the correct command line to use for testing this!