sunitparekh / data-anonymization

Want to use production data for testing, data-anonymization can help you.
MIT License
459 stars 92 forks source link

whitelisted fields are not persisted in the destination database #17

Closed jabley closed 11 years ago

jabley commented 12 years ago

I'm trying to use this gem, but having problems copying the data over.

I have this but the resulting table only contains the id column; the created_at and updated_at fields are populated but with the current time rather than the values in the source database and all of the other fields are empty / nil.

When I run the task under the debugger, I see this

/var/govuk/data-anonymization/lib/strategy/whitelist.rb:23
dest_record.save!
(rdb:1) p dest_record_map
{"legacy_id"=>"1", "group_description"=>"Trading", "category_description"=>"Loss of Market", "code"=>"A.10.10", "code_description"=>"Competition"}
(rdb:1) p dest_record
#<#<Class:0x000000071bd1c0> id: 1, legacy_id: nil, group_description: nil, category_description: nil, code: nil, code_description: nil, ar_timestamp: nil, ar_insert_timestamp: nil, created_at: nil, updated_at: nil>

To me, this looks like I've set up the anonymization correctly (for now, I just want to copy the data and once I'm happy that works, I'll add the correct anonymization strategies and other tables in) but the data isn't being persisted for some reason.

Can you help me understand what I'm doing wrong?

ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux] running under Ubuntu 10.04

jabley commented 12 years ago

It looks to be a rails mass assignment thing. I'm using rails 3.2.8.

(rdb:1) p dest_record
#<#<Class:0x00000005dfc870> id: 1, legacy_id: nil, group_description: nil, category_description: nil, code: nil, code_description: nil, ar_timestamp: nil, ar_insert_timestamp: nil, created_at: nil, updated_at: nil>
(rdb:1) p dest_record_map
{"legacy_id"=>"1", "group_description"=>"Trading", "category_description"=>"Loss of Market", "code"=>"A.10.10", "code_description"=>"Competition"}
(rdb:1) eval dest_record.attributes= dest_record_map
{"legacy_id"=>"1", "group_description"=>"Trading", "category_description"=>"Loss of Market", "code"=>"A.10.10", "code_description"=>"Competition"}
(rdb:1) p dest_record
#<#<Class:0x00000005dfc870> id: 1, legacy_id: nil, group_description: nil, category_description: nil, code: nil, code_description: nil, ar_timestamp: nil, ar_insert_timestamp: nil, created_at: nil, updated_at: nil>
(rdb:1) eval dest_record.assign_attributes(dest_record_map, without_protection: true)
nil
(rdb:1) p dest_record
#<#<Class:0x00000005dfc870> id: 1, legacy_id: "1", group_description: "Trading", category_description: "Loss of Market", code: "A.10.10", code_description: "Competition", ar_timestamp: nil, ar_insert_timestamp: nil, created_at: nil, updated_at: nil>

The mass assignment sanitizer is correct, but perhaps isn't quite having the intended effect.

(rdb:1) p dest_record.class._mass_assignment_sanitizer
=> #<DataAnon::Utils::MassAssignmentIgnoreSanitizer:0x0000000682dce0>

I can potentially work around it by changing the assignment to happen not by passing it to the initializer, but instead by doing dest_record.assign_attributes(dest_record_map, without_protection: true) but that feels pretty nasty. I'll have a poke around for cleaner solutions.

sunitparekh commented 12 years ago

thank u figuring out the problem. Give me a day and I will get something working in for you. I am out from the no internet zone and can work freely to help you

regards, Sunit.

On Fri, Nov 9, 2012 at 6:24 PM, James Abley notifications@github.comwrote:

It looks to be a rails mass assignment thing. I'm using rails 3.2.8.

(rdb:1) p dest_record

<#Class:0x00000005dfc870 id: 1, legacy_id: nil, group_description: nil, category_description: nil, code: nil, code_description: nil, ar_timestamp: nil, ar_insert_timestamp: nil, created_at: nil, updated_at: nil>

(rdb:1) p dest_record_map {"legacy_id"=>"1", "group_description"=>"Trading", "category_description"=>"Loss of Market", "code"=>"A.10.10", "code_description"=>"Competition"} (rdb:1) eval dest_record.attributes= dest_record_map {"legacy_id"=>"1", "group_description"=>"Trading", "category_description"=>"Loss of Market", "code"=>"A.10.10", "code_description"=>"Competition"} (rdb:1) p dest_record

<#Class:0x00000005dfc870 id: 1, legacy_id: nil, group_description: nil, category_description: nil, code: nil, code_description: nil, ar_timestamp: nil, ar_insert_timestamp: nil, created_at: nil, updated_at: nil>

(rdb:1) eval dest_record.assign_attributes(dest_record_map, without_protection: true) nil (rdb:1) p dest_record

<#Class:0x00000005dfc870 id: 1, legacy_id: "1", group_description: "Trading", category_description: "Loss of Market", code: "A.10.10", code_description: "Competition", ar_timestamp: nil, ar_insert_timestamp: nil, created_at: nil, updated_at: nil>

The mass assignment sanitizer is correct, but perhaps isn't quite having the intended effect.

(rdb:1) p dest_record.class._mass_assignment_sanitizer => #DataAnon::Utils::MassAssignmentIgnoreSanitizer:0x0000000682dce0

I can potentially work around it by changing the assignment to happen not by passing it to the initializer, but instead by doing dest_record.assign_attributes(dest_record_map, without_protection: true) but that feels pretty nasty. I'll have a poke around for cleaner solutions.

— Reply to this email directly or view it on GitHubhttps://github.com/sunitparekh/data-anonymization/issues/17#issuecomment-10237246.

thanks & regards, Sunit parekh.sunit@gmail.com

sunitparekh commented 11 years ago

to replicate the error you are getting I added timestamp to my tests. However, things are working fine as expected for me. Checkout the following test that I updated for testing your scenario.

https://github.com/sunitparekh/data-anonymization/blob/master/spec/acceptance/rdbms_whitelist_spec.rb#L51

Let me know if the scenario is something different. I am online on Skype for most of the day. My skype id 'sunitparekh'. Will be happy to help you today. I have time.

jabley commented 11 years ago

Sorry, been doing family stuff most of the day. Thanks for the offer.

I guess I have 2 issues here.

  1. timestamp fields aren't being populated from the source database.
  2. The fields that are being read from the source database aren't being persisted.

The debugger output above shows that fields are being read from the source database, but the timestamp fields aren't.

The mass assignment problem I described and potential fix would cure my problem with fields not being persisted, but I'm not clear why the timestamp fields don't appear to be read from the source database.

I'll take a look into that tomorrow.

sunitparekh commented 11 years ago

Would it be possible to create one sample test send me across, so I can put that as my test and reproduce the issue. That will help a lot to fix the problem.

regards, Sunit.

On Sun, Nov 11, 2012 at 11:43 PM, James Abley notifications@github.comwrote:

Sorry, been doing family stuff most of the day. Thanks for the offer.

I guess I have 2 issues here.

  1. timestamp fields aren't being populated from the source database.
  2. The fields that are being read from the source database aren't being persisted.

The debugger output above shows that fields are being read from the source database, but the timestamp fields aren't.

The mass assignment problem I described and potential fix would cure my problem with fields not being persisted, but I'm not clear why the timestamp fields don't appear to be read from the source database.

I'll take a look into that tomorrow.

— Reply to this email directly or view it on GitHubhttps://github.com/sunitparekh/data-anonymization/issues/17#issuecomment-10273447.

thanks & regards, Sunit parekh.sunit@gmail.com

jabley commented 11 years ago

I have a test case, after a fashion.

https://github.com/alphagov/EFG/commits/data-extract-failure contains a copy of a test from the data-anonymization gem. It passes (in isolation, there are some other effects going on that cause existing specs to fail) but if you check out that repo, use the data-extract-failure branch:

bundle exec rake spec SPEC=spec/lib/extractor_spec.rb

can run the test in isolation, and the modification to CustomerSample.insert_record is needed to avoid a test failure due to what looks like mass assignment protection. If you remove that change, it will fail.

Hope this helps.

UPDATE you'll need to remove the pending from the spec as well; it's a WIP.

The failure I see when running that spec:

1) Extractor should anonymize customer table record 
     Failure/Error: new_rec.address.should == 'F 501 Shanti Nagar'
       expected: "F 501 Shanti Nagar"
            got: nil (using ==)
     # ./spec/lib/extractor_spec.rb:48:in `block (2 levels) in <top (required)>'
jabley commented 11 years ago

I think it's down to our use of config.active_record.whitelist_attributes = true

I'll see if I can create a better test case within the data-anonymization project which demonstrates this.

sunitparekh commented 11 years ago

Done. I have accepted your pull request with one minor modification.

in place of doing it as assign_attributes call, I passed the config value without_protection: true in constructer. Your test is passing after that change, so I hope that's fine.

Thank you using library. Let me know if I can help with any other issues.

sunitparekh commented 11 years ago

Released v0.5.3 Happy data anonymization :-)