Feature request: automatically sanitize disallowed ndjson chars

brianwdavis commented 4 years ago

Would you be interested in me filing a PR for a config setting that automatically replaces : and \n in messages before they're written? Any ideas what they would get replaced with? I'm using _ and ___ respectively in my logs. It would be up to you whether that's a default or not. I haven't been able to figure out why colons in some places are problematic and not others, but newlines always prevent reading back the logs.

ryapric commented 4 years ago

For this specific decision, I've got it listed in the second paragraph of the loggit 2.0.0 section of NEWS.md here. This is one of those things that I'd rather not let loggit impose on the user, at least for now. If reading the logs with an external tool, maybe someone wants to read a raw string including any "dirty" characters (for whatever reason).

Also, even though there's a relatively short list of offending characters, there are surely more that aren't just three kinds of newlines (like funky Unicode characters or something). By putting the onus of sanitization on the dev, they can have full control over intake. For example, say you've got a Shiny app or Plumber API that you let folks POST to -- there's probably a whole mess of stuff you'd like to control, both in terms of prevention-by-failure as well as custom replacement.

I can see maybe letting loggit take a sanitizer/clean-up function or something, which defaults to no function. But still wondering if that's something worth offering vs. letting the dev do pre-log, since the work of defining that function has to be done by them anyway.

I'd be happy to see a PR idea you come up with though!

ryapric commented 4 years ago

Coming back to this: I'm having a pretty serious change of heart. In order to make loggit dependency-free, I had to reimplement a simple JSON parser from scratch. The problem is, it's based on string parsing and not an AST builder or something, so as of v2.0.1, the logs can't even include things like embedded commas, braces, etc. This is a HUGE problem, and I don't know what I was thinking to let that happen / not test for it.

So, the cleanest solution I can think of that stays true to the promise of being dependency-free, is to implement an auto-sanitizer to complement the string parser, as you've suggested here, but that will have to sanitize all disallowed characters.

I've started working on a default one, not live yet, but would appreciate any input from anyone following along!

Edit: the proposed changes are up on the v.2.1.0 branch.

ryapric commented 4 years ago

This is now on master, with sanitizers and unsanitizers (for read_logs()) supported. Details are up in NEWS.md and in the Rd docs, regarding how to write your own as well as what the defaults do. If all the checks pass from CRAN's end, I'll be pushing this out as v2.1.0 to CRAN within the next few hours.

This was a really great catch @brianwdavis, so thank you! Ideally, though, the logs shouldn't have any invalid (i.e. JSON-reserved) characters anyway, but that's really easy to not do. This change should at least help keep logs, but also guide users to restructure their own code if necessary.

brianwdavis commented 4 years ago

Great! Glad I could provide some inspiration on it. I'm still a little unsure about how the implementation will work in the wild, because I want to be able to log functions that I haven't written the stop(...) and message(...) calls myself.

My logic behind that was that in catching errors/warnings from other packages, they may have pathological chars in them that you're not planning for. Since you're using string methods to build up the log line instead of an AST builder, you're not auto-escaping things.

My own unsatisfactory approach was to write a sanitizer and wrap every top-level function call with a purrr-style adverb to handle any output, but that got unwieldy fast. And rather than write my own dictionary like you did, I used HTML escape chars, like <br> for newline and ' for '.

Another particular pathological output I found by breaking something was \t, which needs to be escaped as an R string to \\t in order for a JSON parser to read it as \t and thus a tab char.

However, I wrote all that and then thought, "well how does a 'proper' JSON parser do it?" And turns out it's not much more than the cases you've already written into the sanitizer, so that's pretty much where I'd want it to be.

...
      case '\\':
      case '"':
      case '\n':
      case '\r':
      case '\t':
      case '\b':
      case '\f':
...

from: https://github.com/jeroen/jsonlite/blob/master/src/escape_chars.c

In regards to my larger point about capturing "foreign" errors in the logfile, it looks like it might be impossible without a wrapper like I implemented that has to be done on every top-level call, which is not the best design pattern to say the least. If a function in a package calls stop, it uses base::stop instead of the mask loggit::stop, and I can't think of a way around that.

brianwdavis commented 4 years ago

To give a reprex of how I'm currently handling it, in an unsatisfactory way, but the only way I can think of:

Note that I haven't installed your latest version, so it doesn't do the automatic unsanitizing of newlines/__LF__ (or quotes) with read_logs() but otherwise I think the behavior is the same.

library(loggit)
#> Warning: package 'loggit' was built under R version 3.6.2
#> 
#> Attaching package: 'loggit'
#> The following objects are masked from 'package:base':
#> 
#>     message, stop, warning

default_ndjson_sanitizer <- function(string, sanitize = TRUE) {
  # String map; will dispatch left-vs.-right replacement based on `sanitize` bool
  map <- list(
    "\\{" = "__LEFTBRACE__",
    "\\}" = "__RIGHTBRACE__",
    '"' = "__DBLQUOTE__",
    "," = "__COMMA__",
    "\r" = "__CR__",
    "\n" = "__LF__"
  )

  for (k in names(map)) {
    if (sanitize) {
      string <- gsub(k, map[k], string)
    } else {
      string <- gsub(map[k], k, string)
    }
  }

  string
}

whinily <- function(f) {
  function(...) {
    withCallingHandlers(
      f(...),
      message = function(m) loggit("INFO", default_ndjson_sanitizer(as.character(m))),
      warning = function(w) loggit("WARN", default_ndjson_sanitizer(as.character(w))),
      error = function(e) loggit("ERROR", default_ndjson_sanitizer(as.character(e)))
    )
  }
}

noisy_fun <- function(x) {
  if (x == 1) {
    base::message("x = 1")
    return(x)
  }

  if (x == 2) {
    base::warning("x = 2")
    return(x)
  }

  if (x == 3) {
    base::stop("x = 3")
  }
}

noisy_fun(1)
#> x = 1
#> [1] 1
noisy_fun(2)
#> Warning in noisy_fun(2): x = 2
#> [1] 2
noisy_fun(3)
#> Error in noisy_fun(3): x = 3

read_logs()
#> Error in read_logs(): Log file does not exist

##############################################
# Notice no conditions have been logged yet! #
##############################################

whinily(noisy_fun)(1)
#> {"timestamp": "2020-06-08T13:20:54-0400", "log_lvl": "INFO", "log_msg": "simpleMessage in base::message(__DBLQUOTE__x = 1__DBLQUOTE__): x = 1__LF____LF__"}
#> x = 1
#> [1] 1
whinily(noisy_fun)(2)
#> {"timestamp": "2020-06-08T13:20:54-0400", "log_lvl": "WARN", "log_msg": "simpleWarning in f(...): x = 2__LF__"}
#> Warning in f(...): x = 2
#> [1] 2
whinily(noisy_fun)(3)
#> {"timestamp": "2020-06-08T13:20:54-0400", "log_lvl": "ERROR", "log_msg": "Error in f(...): x = 3__LF__"}
#> Error in f(...): x = 3

read_logs()
#>                  timestamp log_lvl
#> 1 2020-06-08T13:20:54-0400    INFO
#> 2 2020-06-08T13:20:54-0400    WARN
#> 3 2020-06-08T13:20:54-0400   ERROR
#>                                                         log_msg
#> 1 simpleMessage in base::message(__DBLQUOTE__x = 1__DBLQUOTE__)
#> 2                                       simpleWarning in f(...)
#> 3                                               Error in f(...)

###############################################
# Conditions were logged correctly, but       #
# reading prematurely terminates at the colon #
###############################################

cat(readLines(get_logfile()), sep = "\n")
#> {"timestamp": "2020-06-08T13:20:54-0400", "log_lvl": "INFO", "log_msg": "simpleMessage in base::message(__DBLQUOTE__x = 1__DBLQUOTE__): x = 1__LF____LF__"}
#> {"timestamp": "2020-06-08T13:20:54-0400", "log_lvl": "WARN", "log_msg": "simpleWarning in f(...): x = 2__LF__"}
#> {"timestamp": "2020-06-08T13:20:54-0400", "log_lvl": "ERROR", "log_msg": "Error in f(...): x = 3__LF__"}

#######################################
# Raw written text is correct though! #
#######################################

^{Created on 2020-06-08 by the reprex package (v0.3.0)}

brianwdavis commented 4 years ago

I've given some more thought to this other aspect, capturing unplanned, unknowable external errors and messages. I don't think that what I want is possible in a CRAN-published package.

The only method I can see of making it work on any random code is to use assignInNamespace to overwrite base::message, base::warning, and base::stop, instead of just masking them in the search path, which is not permitted on CRAN as far as I know. And that won't capture conditions from packages that use other alerting interfaces, like rlang::abort.

Moreover, my workaround adverb doesn't capture output from functions that are not called from the top level. For example, ggplot2::stat_ellipse calls a function which calls a function calculate_ellipse, which contains a call to message. Wrapping with whinily(stat_ellipse)(...) won't write any messages to the logfile from that inner function.

Ultimately I think I'm stumped. I could write my own package from the ground up by calling sink at the top of every file, but I'd rather not go down that road. For what it's worth, the reason I was drawn to this package instead of the other logging packages out there is that it requires so little extra infrastructure. I think expecting a solution with zero extra infrastructure is just not possible though. Thanks!

ryapric commented 4 years ago

The only method I can see of making it work on any random code is to use assignInNamespace to overwrite base::message, base::warning, and base::stop, instead of just masking them in the search path, which is not permitted on CRAN as far as I know. And that won't capture conditions from packages that use other alerting interfaces, like rlang::abort.

Ahhh see that was one of my very first thoughts, too! If you look at old versions of the README, you'll see that I "had that feature in the works" at one point a few years ago. I tried to solve it in a similar way that you described, but even more crudely: look at the packaged loaded into the current namespace, find their library locations on disk, edit their Imports: field, and recompile/reinstall each of them so that loggit was part of their native NAMESPACE files. But that just felt... like a tremendous, dirty abuse of control on a host machine. Plus, you're mostly correct that CRAN would frown on that super hard, but as long as it's an explicit request by the user after install/load, then it would be "permitted". But it just doesn't feel right at all.

It's still definitely one of those things that I want to explore more here, but I want to make sure that whatever the eventual solution is is clean & efficient, and not kludgy or crude like that.

ryapric / loggit

Feature request: automatically sanitize disallowed ndjson chars #14