Open slager opened 8 months ago
Is there a way to hide IP addresses and coordinates in VCR the same way we hide the API key?
It looks like you can filter data via regex in {vcr
}, so something like this would remove coordinates from URLs and JSON strings in the fixtures:
library("vcr") # *Required* as vcr is set up on loading
invisible(vcr::vcr_configure(
filter_sensitive_data = list("<<<redacted>>>" = Sys.getenv('EBIRD_KEY')),
+ filter_sensitive_data_regex = list(
+ '"lat":<<<redacted>>>' = '("lat"|"latitude"):(-?\\d+(\\.\\d+)?)',
+ "lat=<<<redacted>>>" = '(lat|latitude)=(-?\\d+(\\.\\d+)?)',
+ '"lng":<<<redacted>>>' = '("lng"|:"longitude"):(-?\\d+(\\.\\d+)?)',
+ "lng=<<<redacted>>>" = '(lng|longitude)=(-?\\d+(\\.\\d+)?)'
+ ),
dir = vcr::vcr_test_path("fixtures")
))
If redaction is the route you want to go I'm happy to take a look.
I tried at one point using vcr/regex to redact latitude, longitude, and IP address, but iirc the tricky part is that they occur inside a long string value. If you can get it to work, go for it.
I tried a few things but every solution I tried that redacted the relevant data ended up causing issues with the vcr testing setup. 🫠
👋🏽 hi, vcr maintainer here. Is there a small reproducible example I can play with to understand the problem and see what the best solution is?
I am not an expert here.
Running testthat::test_file("tests/testthat/test-ebirdgeo.R", reporter="summary")
, I expected vcr to make a fixture in the line where I wrote expect_no_error
. It didn't and I'm not sure why. But this should call getlatlng
, which should look for my IP address and use that to get a location - which is what we're trying to avoid putting into the fixtures.
vcr::use_cassette("ebirdgeo", {
test_that("ebirdgeo works correctly", {
expect_warning(egeo <- ebirdgeo('amegfi', 42.45, -76.50, dist = 51, max = 2),
"Distance supplied was >50km")
# Added this line
expect_no_error(ebirdgeo('amegfi'))
expect_true(inherits(egeo, "data.frame"))
expect_gt(NCOL(egeo), 10)
expect_equal(nrow(egeo), 2)
expect_true(inherits(egeo$locName, "character"))
expect_true(inherits(egeo$howMany, "integer"))
expect_warning(
prov <- ebirdgeo(
lat = 42.45, lng = -76.50, max=2,
provisional=TRUE, hotspot=TRUE, back = 31.5),
"using 30 days"
)
expect_true(inherits(prov, "data.frame"))
})
})
For a reprex of the main problem here as I understand it, clone the vcr_demo
branch of this repo and run devtools::test(filter = 'DEMO')
to create the DEMO.yml fixture. The goal is for string
in this fixture to have the IP address redacted.
Running this:
filter_sensitive_data_regex = list(": \"<<<redacted>>>\"," = ': ".*",'),
I managed to get this:
http_interactions:
- request:
method: get
uri: http://ipinfo.io/
body:
encoding: ''
string: ''
headers:
Accept: application/json, text/xml, application/xml, */*
response:
status:
status_code: 200
category: Success
reason: OK
message: 'Success: (200) OK'
headers:
access-control-allow-origin: '*'
content-encoding: gzip
content-type: application/json; charset=utf-8
date: Thu, 03 Oct 2024 08:50:13 GMT
referrer-policy: strict-origin-when-cross-origin
vary: Accept-Encoding
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-xss-protection: 1; mode=block
content-length: '236'
via: 1.1 google
strict-transport-security: max-age=2592000; includeSubDomains
body:
encoding: ''
file: no
string: |-
{
"ip": "<<<redacted>>>",
"readme": "https://ipinfo.io/missingauth"
}
recorded_at: 2024-10-03 08:50:13 GMT
recorded_with: vcr/1.6.0, webmockr/0.9.0
I think this is what we want.
Having done that:
I'm wondering if we should just remove the IP-based location determination. It's neat, but it creates potential privacy issues by exposing IP address and coordinates in the VCR fixtures. Also my experimenting with it shows that the location can be quite far off from my actual location.
I agree with removing it. It was quite far off for me - and this is a wrapper for the eBird API, which doesn't automatically look for a location based on IP. I think removing it and throwing a failure if latlng isn't included would be fine.
That regex does remove the IP, but unfortunately it also removes the lat-long and all the other fields except the last one.
Right.
I don't understand why the .yml file outputs it as a pretty-printed JSON format, but the vcr regex treats it as a single string. I feel like ".*"
is grabbing everything from first to last quotes, instead of stopping at the newlines that I see in the .yml file.
Could there there a smarter way for us to format the string as JSON, and then selectively keep only the fields we want, instead of relying on regex?
Taking a crack at this again. I can't figure out how to use escape characters in the regex, or backreferences as might be possible with gsub()
.
For instance, for [0-9\.\-,]
, I get this error:
Error: '\.' is an unrecognized escape in character string (/Users/richard/src/rebird/tests/testthat/helper-vcr.R:4:77)
Basically, filter_sensitive_data_regex
only provides a subset of possible regexes, which makes doing this work quite difficult.
Periods need to be doubly escaped like \\.
See ?regex
Thank you. I think I got it - take a look at #138.
@RichardLitt
I don't understand why the .yml file outputs it as a pretty-printed JSON format, but the vcr regex treats it as a single string. I feel like ".*" is grabbing everything from first to last quotes, instead of stopping at the newlines that I see in the .yml file.
in vcr we just read the whole file in with the yaml package, and then I can index to different parts of the request or response. the newlines are not meagningful, i.e., I would not write code accounting for them
Looks like you got it working in #138 @RichardLitt - let me know if you have any further questions.
I'm wondering if we should just remove the IP-based location determination. It's neat, but it creates potential privacy issues by exposing IP address and coordinates in the VCR fixtures. Also my experimenting with it shows that the location can be quite far off from my actual location.