ropensci / rebird

Wrapper to the eBird API
https://docs.ropensci.org/rebird
Other
83 stars 17 forks source link

Remove IP-based location determination? #115

Open slager opened 8 months ago

slager commented 8 months ago

I'm wondering if we should just remove the IP-based location determination. It's neat, but it creates potential privacy issues by exposing IP address and coordinates in the VCR fixtures. Also my experimenting with it shows that the location can be quite far off from my actual location.

sebpardo commented 8 months ago

Is there a way to hide IP addresses and coordinates in VCR the same way we hide the API key?

jrdnbradford commented 3 months ago

It looks like you can filter data via regex in {vcr}, so something like this would remove coordinates from URLs and JSON strings in the fixtures:

library("vcr") # *Required* as vcr is set up on loading
invisible(vcr::vcr_configure(
  filter_sensitive_data = list("<<<redacted>>>" = Sys.getenv('EBIRD_KEY')),
+ filter_sensitive_data_regex = list(
+   '"lat":<<<redacted>>>' = '("lat"|"latitude"):(-?\\d+(\\.\\d+)?)',
+   "lat=<<<redacted>>>" = '(lat|latitude)=(-?\\d+(\\.\\d+)?)',
+   '"lng":<<<redacted>>>' = '("lng"|:"longitude"):(-?\\d+(\\.\\d+)?)',
+   "lng=<<<redacted>>>" = '(lng|longitude)=(-?\\d+(\\.\\d+)?)'
+ ),
  dir = vcr::vcr_test_path("fixtures")
))

If redaction is the route you want to go I'm happy to take a look.

slager commented 3 months ago

I tried at one point using vcr/regex to redact latitude, longitude, and IP address, but iirc the tricky part is that they occur inside a long string value. If you can get it to work, go for it.

jrdnbradford commented 2 months ago

I tried a few things but every solution I tried that redacted the relevant data ended up causing issues with the vcr testing setup. 🫠

sckott commented 1 month ago

👋🏽 hi, vcr maintainer here. Is there a small reproducible example I can play with to understand the problem and see what the best solution is?

RichardLitt commented 1 month ago

I am not an expert here.

Running testthat::test_file("tests/testthat/test-ebirdgeo.R", reporter="summary"), I expected vcr to make a fixture in the line where I wrote expect_no_error. It didn't and I'm not sure why. But this should call getlatlng, which should look for my IP address and use that to get a location - which is what we're trying to avoid putting into the fixtures.

vcr::use_cassette("ebirdgeo", {
  test_that("ebirdgeo works correctly", {
    expect_warning(egeo <- ebirdgeo('amegfi', 42.45, -76.50, dist = 51, max = 2),
                   "Distance supplied was >50km")
    # Added this line
    expect_no_error(ebirdgeo('amegfi'))
    expect_true(inherits(egeo, "data.frame"))
    expect_gt(NCOL(egeo), 10)
    expect_equal(nrow(egeo), 2)
    expect_true(inherits(egeo$locName, "character"))
    expect_true(inherits(egeo$howMany, "integer"))
    expect_warning(
      prov <- ebirdgeo(
        lat = 42.45, lng = -76.50, max=2,
        provisional=TRUE, hotspot=TRUE, back = 31.5),
      "using 30 days"
    )
    expect_true(inherits(prov, "data.frame"))
  })
})
slager commented 1 month ago

For a reprex of the main problem here as I understand it, clone the vcr_demo branch of this repo and run devtools::test(filter = 'DEMO') to create the DEMO.yml fixture. The goal is for string in this fixture to have the IP address redacted.

RichardLitt commented 1 month ago

Running this:

  filter_sensitive_data_regex = list(": \"<<<redacted>>>\"," = ': ".*",'),

I managed to get this:

http_interactions:
- request:
    method: get
    uri: http://ipinfo.io/
    body:
      encoding: ''
      string: ''
    headers:
      Accept: application/json, text/xml, application/xml, */*
  response:
    status:
      status_code: 200
      category: Success
      reason: OK
      message: 'Success: (200) OK'
    headers:
      access-control-allow-origin: '*'
      content-encoding: gzip
      content-type: application/json; charset=utf-8
      date: Thu, 03 Oct 2024 08:50:13 GMT
      referrer-policy: strict-origin-when-cross-origin
      vary: Accept-Encoding
      x-content-type-options: nosniff
      x-frame-options: SAMEORIGIN
      x-xss-protection: 1; mode=block
      content-length: '236'
      via: 1.1 google
      strict-transport-security: max-age=2592000; includeSubDomains
    body:
      encoding: ''
      file: no
      string: |-
        {
          "ip": "<<<redacted>>>",
          "readme": "https://ipinfo.io/missingauth"
        }
  recorded_at: 2024-10-03 08:50:13 GMT
  recorded_with: vcr/1.6.0, webmockr/0.9.0

I think this is what we want.

RichardLitt commented 1 month ago

Having done that:

I'm wondering if we should just remove the IP-based location determination. It's neat, but it creates potential privacy issues by exposing IP address and coordinates in the VCR fixtures. Also my experimenting with it shows that the location can be quite far off from my actual location.

I agree with removing it. It was quite far off for me - and this is a wrapper for the eBird API, which doesn't automatically look for a location based on IP. I think removing it and throwing a failure if latlng isn't included would be fine.

slager commented 1 month ago

That regex does remove the IP, but unfortunately it also removes the lat-long and all the other fields except the last one.

RichardLitt commented 1 month ago

Right.

I don't understand why the .yml file outputs it as a pretty-printed JSON format, but the vcr regex treats it as a single string. I feel like ".*" is grabbing everything from first to last quotes, instead of stopping at the newlines that I see in the .yml file.

Could there there a smarter way for us to format the string as JSON, and then selectively keep only the fields we want, instead of relying on regex?

RichardLitt commented 1 month ago

Taking a crack at this again. I can't figure out how to use escape characters in the regex, or backreferences as might be possible with gsub().

For instance, for [0-9\.\-,], I get this error:

Error: '\.' is an unrecognized escape in character string (/Users/richard/src/rebird/tests/testthat/helper-vcr.R:4:77)

Basically, filter_sensitive_data_regex only provides a subset of possible regexes, which makes doing this work quite difficult.

slager commented 1 month ago

Periods need to be doubly escaped like \\. See ?regex

RichardLitt commented 1 month ago

Thank you. I think I got it - take a look at #138.

sckott commented 1 month ago

@RichardLitt

I don't understand why the .yml file outputs it as a pretty-printed JSON format, but the vcr regex treats it as a single string. I feel like ".*" is grabbing everything from first to last quotes, instead of stopping at the newlines that I see in the .yml file.

in vcr we just read the whole file in with the yaml package, and then I can index to different parts of the request or response. the newlines are not meagningful, i.e., I would not write code accounting for them

sckott commented 1 month ago

Looks like you got it working in #138 @RichardLitt - let me know if you have any further questions.