sunitparekh / data-anonymization

Want to use production data for testing, data-anonymization can help you.
MIT License
459 stars 92 forks source link

Using SelectFromDatabase with table IDs finds nils instead of ID numbers. #38

Open Grammarella opened 8 years ago

Grammarella commented 8 years ago

This is a great gem. I'm using 0.7.3 right now and the Blacklist strategy to anonymize some sensitive data. I'm looking forward to using the parallel table execution strategy, but for the sake of this bug report I am using the normal sequential execution.

I got very excited about the SelectFromDatabase method and endeavored to use it for a number of things in order to "scramble" table references. First, and this is minor, the example at http://www.rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/SelectFromDatabase does not make it clear that I should reuse the connection_spec for each SelectFromDatabase call, though I did figure out that I should keep passing that in from the constructor details.

The actual issue is that I have tables such as features_trainings (in this example) where I want to scramble the feature referred to in the linking table. I want to take the feature_id foreign key and replace it such that it refers to any random feature in the features table. So for example, I have this:

        table 'features_trainings' do
          primary_key 'training_id', 'feature_id'
          batch_size 1000
          anonymize('feature_id').using FieldStrategy::SelectFromDatabase.new('features','id', my_connection_spec)
        end

(Though the composite key is used here, I observe the same issue with tables that have primary_key 'id' in use.)

What happens is that the values returned by anonymizing are all nils. I tracked this down to select_from_database.rb, line 17. The odd thing is that source.select(field_name) returns a value like this:

#<ActiveRecord::Relation [#<DataAnon::Utils::SourceDatabase::Features id: nil>, #<DataAnon::Utils::SourceDatabase::Features id: 
nil>, #<DataAnon::Utils::SourceDatabase::Features id: nil>, #<DataAnon::Utils::SourceDatabase::Features id: nil>, #<DataAnon::Ut
ils::SourceDatabase::Features id: nil>, #<DataAnon::Utils::SourceDatabase::Features id: nil>, #<DataAnon::Utils::SourceDatabase:
:Features id: nil>, #<DataAnon::Utils::SourceDatabase::Features id: nil>, #<DataAnon::Utils::SourceDatabase::Features id: nil>, 
#<DataAnon::Utils::SourceDatabase::Features id: nil>, ...]>

For some reason, all the IDs are nil. Interestingly,

source.select(field_name).first[field_name]

returns nil, but

source.select(field_name).first.id

returns the actual id number (1, 2, etc.). Part of me thinks that changing that call on that line slightly would fix it, but I have a sense that what is going on may be more subtle than that, since source produces all those nils.

Of course, one workaround is obvious, and I will move to doing my own separate query for feature ids and using SelectFromList. But it seemed to me that SelectFromDatabase could be a powerful tool, and I thought this was worth reporting. Thanks again for this gem and its documentation.

sunitparekh commented 8 years ago

hey, I released rail5 upgraded version 0.8.0.rc1 of data anonymisation and I fixed and tested SelectFromDatabase and it is working fine. refer example https://github.com/sunitparekh/test-anonymization/blob/master/dell_whitelist.rb

please try and provide me feedback

Grammarella commented 8 years ago

Thank you very much for looking into this. I appreciate it.

Some background information: I'm using data-anonymization in a rake task, and as I'm not ready to upgrade my entire project to Rails 5, I am testing this in a copied project that tries to approximate the original situation. I'm also using mysql2 as my database adapter.

Here's my test table: table 'addresses' do primary_key 'id' batch_size 1000 anonymize('state').using FieldStrategy::SelectFromDatabase.new('addresses', 'state', db_config) anonymize('country_id').using FieldStrategy::SelectFromDatabase.new('countries', 'id', db_config) end

And this is my error: Mysql2::Error: Unknown column 'addresses.' in 'order clause': SELECT addresses.* FROM addresses ORDER BY addresses.`` ASC LIMIT 1000

It's tempting to think it's a mysql2 error, but I don't think that's the case. I debugged in the activerecord gem, activerecored-5.0.0.1/lib/active_record/relation/batches.rb. In the in_batches method, the reorder call appears to be where this breaks. The problem is that batch order returns this: addresses.`` ASC

Looking at how batch_order is defined, it appears that the problem is that quoted_primary_key is empty, and in my debugging, this indeed seems to be the case. So it almost seems as if something about SelectFromDatabase is losing a concept of the primary key, even though I did set that manually in this example.

olly commented 6 years ago

I'm also experiencing this issue when using Rails v5.1.5. It only occurs when a batch_size is specified. The backtrace is see is:

Processing table payment_infos records in batch size of 1000
  DataAnon::Utils::SourceDatabase::Payment_infos Load (1.1ms)  SELECT  "payment_infos".* FROM "payment_infos" ORDER BY "payment_infos"."" ASC LIMIT $1  [["LIMIT", 1000]]

PG::SyntaxError: ERROR:  zero-length delimited identifier at or near """"
LINE 1: ...".* FROM "payment_infos" ORDER BY "payment_infos"."" ASC LIM...
                                                             ^
: SELECT  "payment_infos".* FROM "payment_infos" ORDER BY "payment_infos"."" ASC LIMIT $1 
 ["/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/connection_adapters/postgresql_adapter.rb:616:in `async_exec'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/connection_adapters/postgresql_adapter.rb:616:in `block (2 levels) in exec_no_cache'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activesupport-5.1.5/lib/active_support/dependencies/interlock.rb:46:in `block in permit_concurrent_loads'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activesupport-5.1.5/lib/active_support/concurrency/share_lock.rb:185:in `yield_shares'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activesupport-5.1.5/lib/active_support/dependencies/interlock.rb:45:in `permit_concurrent_loads'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/connection_adapters/postgresql_adapter.rb:615:in `block in exec_no_cache'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/connection_adapters/abstract_adapter.rb:613:in `block (2 levels) in log'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/2.3.0/monitor.rb:214:in `mon_synchronize'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/connection_adapters/abstract_adapter.rb:612:in `block in log'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activesupport-5.1.5/lib/active_support/notifications/instrumenter.rb:21:in `instrument'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/connection_adapters/abstract_adapter.rb:604:in `log'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/connection_adapters/postgresql_adapter.rb:614:in `exec_no_cache'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/connection_adapters/postgresql_adapter.rb:603:in `execute_and_clear'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/connection_adapters/postgresql/database_statements.rb:79:in `exec_query'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/connection_adapters/abstract/database_statements.rb:371:in `select'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/connection_adapters/abstract/database_statements.rb:42:in `select_all'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/connection_adapters/abstract/query_cache.rb:97:in `select_all'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/querying.rb:39:in `find_by_sql'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/relation.rb:678:in `exec_queries'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/relation.rb:546:in `load'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/relation.rb:255:in `records'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/relation/batches.rb:216:in `block in in_batches'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/relation/batches.rb:214:in `loop'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/relation/batches.rb:214:in `in_batches'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/relation/batches.rb:128:in `find_in_batches'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activerecord-5.1.5/lib/active_record/relation/batches.rb:62:in `find_each'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/bundler/gems/data-anonymization-161ddb329293/lib/strategy/base.rb:132:in `process_table_in_batches'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/bundler/gems/data-anonymization-161ddb329293/lib/strategy/base.rb:105:in `process'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/bundler/gems/data-anonymization-161ddb329293/lib/core/database.rb:68:in `block in anonymize'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/bundler/gems/data-anonymization-161ddb329293/lib/core/database.rb:66:in `each'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/bundler/gems/data-anonymization-161ddb329293/lib/core/database.rb:66:in `anonymize'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/bundler/gems/data-anonymization-161ddb329293/lib/core/database.rb:46:in `anonymize'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/bundler/gems/data-anonymization-161ddb329293/lib/core/dsl.rb:10:in `database'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/bundler/gems/switcher-common-0029bd85b8fc/lib/switcher/anonymize.rb:55:in `execute'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/bundler/gems/switcher-common-0029bd85b8fc/lib/switcher/anonymize.rb:45:in `run'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/bundler/gems/switcher-common-0029bd85b8fc/lib/switcher/anonymize.rb:11:in `run'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/bundler/gems/switcher-common-0029bd85b8fc/lib/switcher/tasks/anonymize.rake:9:in `block in <top (required)>'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rake-12.3.0/lib/rake/task.rb:251:in `block in execute'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rake-12.3.0/lib/rake/task.rb:251:in `each'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rake-12.3.0/lib/rake/task.rb:251:in `execute'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/bugsnag-5.3.3/lib/bugsnag/rake.rb:12:in `execute_with_bugsnag'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rake-12.3.0/lib/rake/task.rb:195:in `block in invoke_with_call_chain'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/2.3.0/monitor.rb:214:in `mon_synchronize'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rake-12.3.0/lib/rake/task.rb:188:in `invoke_with_call_chain'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rake-12.3.0/lib/rake/task.rb:181:in `invoke'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rake-12.3.0/lib/rake/application.rb:160:in `invoke_task'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rake-12.3.0/lib/rake/application.rb:116:in `block (2 levels) in top_level'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rake-12.3.0/lib/rake/application.rb:116:in `each'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rake-12.3.0/lib/rake/application.rb:116:in `block in top_level'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rake-12.3.0/lib/rake/application.rb:125:in `run_with_threads'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rake-12.3.0/lib/rake/application.rb:110:in `top_level'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rake-12.3.0/lib/rake/application.rb:83:in `block in run'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rake-12.3.0/lib/rake/application.rb:186:in `standard_exception_handling'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rake-12.3.0/lib/rake/application.rb:80:in `run'", "/Users/Olly/Work/Switcher/broadband-leads/bin/rake:6:in `<top (required)>'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activesupport-5.1.5/lib/active_support/dependencies.rb:286:in `load'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activesupport-5.1.5/lib/active_support/dependencies.rb:286:in `block in load'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activesupport-5.1.5/lib/active_support/dependencies.rb:258:in `load_dependency'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/activesupport-5.1.5/lib/active_support/dependencies.rb:286:in `load'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'", "/usr/local/var/rbenv/versions/2.3.1/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'", "-e:1:in `<main>'"]
olly commented 6 years ago

I managed to fix this by altering SelectFromDatabase. I believe the root cause is that SelectFromDatabase tries to collect it's values before the table is fully configured, and there's no primary key. When rails tries to reorder as part of the in_batches method, it tries to use a nil primary key.

See: https://gist.github.com/olly/6388e7db348d1023e340109ea9ce0362

sunitparekh commented 6 years ago

@olly thanks for sending patch to fix issues. I merged your patch and released gem version to v0.8.2 thanks a lot once again for your contribution.