toy / dump

Rails app rake and capistrano tasks to create and restore dumps of database and assets
MIT License
89 stars 14 forks source link

added remove indexes before bulk insert and add indexes after #12

Open aishek opened 9 years ago

aishek commented 9 years ago

I've improved bulk insert performance: remove all indexes before first insert, and create after all inserts to avoid index rebuilds on each insert operation.

toy commented 9 years ago

Doesn't work for rails < 4.0: travis.

aishek commented 9 years ago

Sorry, didn't notice that – fixed now.

toy commented 9 years ago

It will be better to use ActiveRecord::ConnectionAdapters::IndexDefinition.members instead of INDEX_FIELDS (for example becasue new rails version can add or change fields). Why do you use all_indexes.select {|index| index.table == table} if ActiveRecord::Base.connection.indexes(table) should already return indexes for one table. There are no tests.

More general — I assume that removing indexes during insertion should make whole process faster, but my attempts with/without removing indexes showed no difference using sqlite and postgres or opposite effect using mysql. Could you please create an example (schema and a script to generate data) which will be faster to expand with indexes removed?

toy commented 9 years ago

@aishek Did you get a notification about my comment?

aishek commented 9 years ago

@toy yeah, I've got it. A lot of work these days, sorry :)

toy commented 9 years ago

@aishek No worries :)

aishek commented 9 years ago

Let's discuss impact to speed for indexes removing. Here is example for sqlite3:

macbook-pro-2:temp aishek$ time sqlite3 test.sqlite3 < create_index_first.sql
real   0m1.150s
user   0m0.351s
sys    0m0.179s

macbook-pro-2:temp aishek$ time sqlite3 test.sqlite3 < create_index_after.sql
real   0m0.611s
user   0m0.299s
sys    0m0.123s

create_index_first.sql create_index_after.sql

I suppose high selectivity of age column in my example make sense.

How do you think, should this feature present in dump? If so, I'll improve my pull request using your comments.

toy commented 9 years ago

I am a convinced.

Reworked example for multiple databases:

#!/usr/bin/env ruby

require 'benchmark'

table_cmd = 'echo "create table people (age int);"'
index_cmd = 'echo "create index people_age on people(age);"'
drop_cmd = 'echo "drop table people;"'

# Create data.sql inserting `n` rows
n = 100_000
per_insert = 500
values = Array(20..30)
File.open('data.sql', 'w') do |f|
  (n / per_insert).times do
    values_sql = Array.new(per_insert){ "('#{values.sample}')" }.join(',')
    f.puts("insert into people (age) VALUES #{values_sql};")
  end
end
data_cmd = 'cat data.sql'

Benchmark.bmbm do |x|
  {
    :sqlite => 'sqlite3 test.sqlite',
    :mysql => 'mysql5 -D test',
    :postgres => 'psql -q test',
  }.each do |db_name, db_cmd|
    x.report("#{db_name}: data then index") do
      system "(#{[table_cmd, data_cmd, index_cmd, drop_cmd] * ';'}) | #{db_cmd}"
    end

    x.report("#{db_name}: index then data") do
      system "(#{[table_cmd, index_cmd, data_cmd, drop_cmd] * ';'}) | #{db_cmd}"
    end
  end
end

Results:

Rehearsal -------------------------------------------------------------
sqlite: data then index     0.000000   0.000000   0.780000 (  0.878953)
sqlite: index then data     0.000000   0.000000   1.060000 (  1.185402)
mysql: data then index      0.000000   0.000000   0.030000 (  0.232668)
mysql: index then data      0.000000   0.000000   0.030000 (  0.717944)
postgres: data then index   0.000000   0.000000   0.040000 (  0.368946)
postgres: index then data   0.000000   0.000000   0.040000 (  0.575253)
---------------------------------------------------- total: 1.980000sec

                                user     system      total        real
sqlite: data then index     0.000000   0.000000   0.780000 (  0.933262)
sqlite: index then data     0.000000   0.000000   1.090000 (  1.398851)
mysql: data then index      0.000000   0.000000   0.030000 (  0.261416)
mysql: index then data      0.000000   0.000000   0.030000 (  0.731750)
postgres: data then index   0.000000   0.000000   0.040000 (  0.376046)
postgres: index then data   0.000000   0.000000   0.040000 (  0.584572)
aishek commented 9 years ago

I am using INDEX_FIELDS because of these approach. I'll little improved naming in last commits: now this array called VALID_INDEX_OPTIONS.

I also removed additional select.

I also added tests, rebased code, fixed pg and i18n gems versions for ruby 1.8.7

toy commented 9 years ago

After looking at rails DB adapters code it seems to me that not all index options which are supported by add_index (like internalor algorithm) will be present in instances of IndexDefinition returned by ActiveRecord::Base.connection.indexes(table). So there should be an environment variable to disable removing/adding indexes and an explanation in readme.

Your current approach for adding indexes ignores at least lengths and orders. To check - try to add an index with length to a table in MySQL DB and check how indexes returned by ActiveRecord::Base.connection.indexes(table) differ before and after restoring table. So maybe it is better to grab all options provided by IndexDefinition and convert those which are arrays to hashes using column names as keys.

The test you've added checks only if with_disabled_indexes is called (also doesn't call the block) and doesn't check what happens to indexes.

Changes to reader.rb in b6e666f1 create long lines so are in reality creating code-style issues. They are not displayed as a some style checks are yet disabled by .rubocop_toto.yml.

I've already cherry-picked your commits about pg and i18n for version with updated progress gem dependency

aishek commented 9 years ago

I'll fix these issues later. Thanks for detailed remark!

aishek commented 9 years ago

Ok, I fixed all:

There is two things to discuss.

First, I've extracter assets-related logic from Reader to separate class to fix Rubocop warning about long class definition. Now there is some inconsistency in code: Summary and Assets classes specs should be extracted to separate files, and DB-related code should be extracted from Reader to separate class.

Second, https://github.com/toy/dump/commit/7966b031625bc5ac4232e2ad32cbda2aa618e6a3#diff-89f6960ada41447910198cd9a526b3dcR73 – ugly send

What do you think?

toy commented 9 years ago

Sorry for long wait.

About extracting assets related logic: better not to do it as part of this pull request — just increase Metrics/ClassLength in .rubocop_todo.yml for now. Code of this gem needs cleanup but not as part of adding features.

Entry in readme — there is script/update_readme to add environment variable descriptions from Env class. Also you should add :rebuild_indexes to :restore_options in variable_names_for_command.

index_options is problematic: Try to add index like add_index :chickens, [:string_col, :text_col], :length => 10 to spec/db/schema.rb and run REBUILD_INDEXES=1 rspec. It's better to add such index to schema and add a spec comparing indexes before and after restore to cycle_spec to test on real database.

Also you handle lengths but not orders, and PostgreSQL adapter returns orders. So I still think that instead of using VALID_INDEX_OPTIONS it will be better to go through index.members and if it is a hash — assign it to "singularised" option name and if it's an array — create hash by zipping it with column names and also assign to "singularised" option name. Do you see problems with such approach?