ropensci / vcr

Record and replay HTTP requests
https://docs.ropensci.org/vcr
Other
77 stars 12 forks source link

The `sensitive_remove` function interprets patterns as regular expressions #222

Closed tomsing1 closed 3 years ago

tomsing1 commented 3 years ago

https://github.com/ropensci/vcr/blob/13e4bc1ea8b4cc96aaaf89714a7b30301b0a725e/R/filter_sensitive_data.R#L13

I ran into the following error when I tried to hide a password via the filter_sensitive_data argument of the vcr_configure configure function:

Error: invalid regular expression 'Das434hJ?DAsFAf9{zd24', reason 'Invalid contents of {}'

It seems that the presence of the { character is throwing off R's gsub function within vcr's sensitive_remove function.

This reproduces the error:

gsub("Das434hJ?DAsFAf9{zd24", "", "Das434hJ?DAsFAf9{zd24")

Error in gsub("Das434hJ?DAsFAf9{zd24", "", "Das434hJ?DAsFAf9{zd24") : 
  invalid regular expression 'Das434hJ?DAsFAf9{zd24', reason 'Invalid contents of {}'

Adding the fixed=TRUE argument to the gsub call fixes it:

gsub("Das434hJ?DAsFAf9{zd24", "", "Das434hJ?DAsFAf9{zd24", fixed = TRUE)

In my case, the passwords I need to remove may contain parentheses (and other characters that might throw off gsub), so adding fixed=TRUE seems like a useful change to prevent it from interpreting the pattern as a regular expression.

On the other hand, if you actually intended gsub to scan for and remove regular expressions, then perhaps that needs to be documented?

Thanks a ton for writing & sharing this great package!

tomsing1 commented 3 years ago

As a workaround, I am now escaping the strings specified in the filter_sensitive_data list, e.g.

# lifted from the Hmisc package
# https://www.rdocumentation.org/packages/Hmisc/versions/4.1-1/topics/escapeRegex
escape_regex <- function(x) {
  gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", x)
}

vcr::vcr_configure(
  filter_sensitive_data = list(
    "SECRET_PASSWORD" = escape_regex(Sys.getenv('SECRET_PASSWORD'))
  ),
  dir = vcr::vcr_test_path("fixtures")
)

Perhaps that's useful for others, in case you prefer not to add fixed = TRUE to the gsub calls....

sckott commented 3 years ago

Thanks for reporting this. I'll have a look

sckott commented 3 years ago

@tomsing1 i've made the fix with fixed=TRUE - can you reinstall remotes::install_github("ropensci/vcr") and try again?

tomsing1 commented 3 years ago

Thanks a lot for looking into this so quickly @sckott ! I installed the latest version and can confirm that this fixes the issue I had.

But - and my apologies if this sounds like I am making things unnecessarily complicated - perhaps this change would break other users' setups?

For example, let's say somebody is actually taking advantage of the regex functionality right now, e.g. to remove a random authentication token:

invisible(vcr::vcr_configure(
  filter_sensitive_data = list(
    "<<TOKEN>>" = '\\{"token":".*"\\}'
  ),
  dir = vcr::vcr_test_path("fixtures")
))

In this case the original default behavior is really helpful. But setting the fixed=TRUE argument in the gsub function calls, this use case would now not be supported any more, e.g. it could break existing user code.

Perhaps exposing fixed=FALSE as an argument to the filter_sensitive_data function and then passing it through to the two gsub calls would be safer? (Setting it to FALSE by default would make this change backward compatible, I believe.)

sckott commented 3 years ago

Good point that it would break in a use case where someone wants to use regex to match a password that is not a fixed string. However, the only behavior I meant to support was the one where we use gsub(..., fixed=TRUE) - and so the code actually follows intended behavior now.

Having said that I guess I could see how it might be useful to use regex matching - but I'm not sure I can myself imagine using it. Even if I issue myself a new API key it is usually another random alphanumeric key and so a regex pattern wouldn't do much.

Can you picture using fixed=FALSE yourself?

tomsing1 commented 3 years ago

Thanks a lot for your quick reply. Great to hear that the fixed use case was what you had in mind.

The "<<TOKEN>>" = '\\{"token":".*"\\}' example is actually something I am using right now. It strips out any alphanumeric pattern preceded by the token: key. Not sure if I am the only one who thinks that's useful, though!

sckott commented 3 years ago

Okay, well 1 regex user suggests there's probably more users out there. I'm not sure yet how we'll do it, one of:

vcr_configure(
  filter_sensitive_data = list(
    exact = list("<<SOME-TOKEN>>" = Sys.getenv("SOME_KEY")), 
    regex = list("<<REGEX-TOKEN>>" = '\\{"token":".*"\\}')
  )
)
sckott commented 3 years ago

@tomsing1 Reinstall to get changes. I've added new config option filter_sensitive_data_regex. One final thing left is this:

With regex replacement, we can't put back the sensitive string unless we capture it and store it somewhere on disk if we want to be able to replace the real string between R sessions. This seems like an added security risk, but is it worth doing? Or do we just say if you use regex string replacement your cassette will not be able to have the real string?

sckott commented 3 years ago

bump @tomsing1

tomsing1 commented 3 years ago

Thanks for the bump - will get back to you as soon as I can!

tomsing1 commented 3 years ago

I finally go to testing the updates you made, apologies for the delay.

Trouble with changes introduced in commit 2e07d00

Initially, I installed the HEAD version (commit) - but that version doesn't record any cassettes on my system (Mac OS X), e.g. it does not create YAML files. More specifically, this commit seems to have broken the recording function. Please let me know if you'd like me to provide more details (or create another issue).

The output of the vcr::vcr_test_path("fixtures") command has changed between the two commits:

Both of them point to the same directory on my system when my current directory is the RStudio project of my R package. So I am not sure what's going on.

Testing sensitive_remove

To focus on this issue, e.g. using regular expressions in the sensitive_remove function I installed commit 25c2bfc, the latest version that actually records cassettes on my system.

It works great:

Thanks for pointing out that patterns matched via regular expressions cannot by put back to avoid storing them on disk. That works for me!

In my use case, I receive tokens after the first round of authentication with a username / password. Afterward, the token is used for subsequent API calls. But because recorded cassettes don't actually make calls, the token is not used (and I set the corresponding environmental variable to foobar during testing, following your example from the README. Does that sound right to you?

Thanks again for enabling both fixed and regex patterns!

tomsing1 commented 3 years ago

One more thing I just noticed (still working with commit 25c2bfc): I had to manually adjust the string in the YAML file, because the pattern was not saved as JSON - and that seems to break the parser:

I changed the line

string: 'token'

to

string: "{\"token\": \"foobar\"}"

to avoid the following error:

Error: lexical error: invalid string in json text.
                                       foobar
                     (right here) ------^

Backtrace:
  1. vcr::use_cassette(...) test-get_token.R:10:2
  3. my_application::get_token(...) test-get_token.R:11:4
  4. httr::content(r, encoding = "UTF-8") /Users/tomsing1/repositories/my_application/R/authentication.R:60:2
  5. httr:::parse_auto(raw, type, encoding, ...)
  6. httr:::parser(...)
  7. jsonlite::fromJSON(...)
  8. jsonlite:::parse_and_simplify(...)
  9. jsonlite:::parseJSON(txt, bigint_as_char)
 10. jsonlite:::parse_string(txt, bigint_as_char)
sckott commented 3 years ago

Thanks for testing!

For vcr::vcr_test_path, I'll have a look - and please do open a new issue

sckott commented 3 years ago

Great to hear that the filter params are working.

Does that sound right to you?

yes, sounds right

sckott commented 3 years ago

For the string not saved as JSON, can you give more detail? What does the use_cassette call look like?

sckott commented 3 years ago

please do open a new issue

no need, see #228