python / cpython

The Python programming language
https://www.python.org
Other
63.32k stars 30.31k forks source link

Log formatters which use newlines to separate messages should quote newlines for security reasons #94503

Open glyph opened 2 years ago

glyph commented 2 years ago

The logging module, in most common configurations, is vulnerable to log injection attacks.

For example:

import logging
logging.basicConfig(format='%(asctime)s %(message)s')
logging.warning('message\n2022-06-17 15:15:15,123 was logged.message')

results in

2022-06-16 14:03:06,858 message
2022-06-17 15:15:15,123 was logged.message

All available log formatters in the standard library should provide a straightforward way to tell the difference between log message contents and log file format framing. For example, if your output format is newline-delimited, then it cannot allow raw newlines in messages and should "sanitize" by quoting them somehow.

Twisted deals with this by quoting them with trailing tabs, so, for example, the following code:


from twisted.logger import globalLogBeginner, textFileLogObserver, Logger
import sys

globalLogBeginner.beginLoggingTo(
    [textFileLogObserver(sys.stdout)], redirectStandardIO=False
)

log = Logger()
log.info("regular log message\nhaha i tricked you this isn't a log message")
log.info("second log message")

Produces this output:

2022-06-17T15:35:13-0700 [__main__#info] regular log message
    haha i tricked you this isn't a log message
2022-06-17T15:35:13-0700 [__main__#info] second log message

I'd suggest that the stdlib do basically the same thing.

One alternate solution is just documenting that no application or framework is ever allowed to log a newlines without doing this manually themselves (and unfortunately this seems to be where the Java world has ended up, see for example https://github.com/spring-projects/spring-framework/commit/e9083d7d2053fea3919bfeb9057e9fdba4049119 ), but putting the responsibility on individual projects to do this themselves means making app and library authors predict all possible Formatters that they might have applied to them, then try to avoid any framing characters that that Formatter might use to indicate a message boundary. Today the most popular default formatter uses newlines. But what if some framework were to try to make parsing easier by using RFC2822? Now every application has to start avoiding colons as well as newlines. CSV? Better make sure you don't use commas. Et cetera, et cetera.

Pushing this up to the app or framework means that every library that wants to log anything derived from user data can't log the data in a straightforward structured way that will be useful to sophisticated application consumers, because they have to mangle their output in a way which won't trigger any log-parsing issues with the naive formats. In practice this really just means newlines, but if we make newlines part of the contract here, that also hems in any future Formatter improvements the stdlib might want to make.

I suspect that the best place to handle this would be logging.Formatter.format; there's even some precedent for this, since it already has a tiny bit of special-cased handling of newlines (albeit only when logging exceptions).

(The right thing to do to avoid logging injection robustly is to emit all your logs as JSON, dump them into Cloudwatch or Honeycomb or something like that and skip this problem entirely. The more that the standard logging framework can encourage users to get onto that happy path quickly, the less Python needs to worry about trying to support scraping stuff out of text files with no schema as an interesting compatibility API surface, but this is a really big problem that I think spans more than one bug report.)

glyph commented 2 years ago

https://github.com/python/cpython/issues/90358 is another injection-into-logs issue, but not quite in the same way.

vsajip commented 2 years ago

I get your point, though the examples you give don't appear to be log injection in the way the OWASP link suggests. I would also have expected log injection to relate to treatment of untrusted input, whereas the examples you give seem to be developer "misbehaviour" - so I'm not sure there's a genuine security issue in your examples.

In terms of handling untrusted input, I'm not sure a change can be made, unless it's opt-in, due to backwards compatibility concerns. Based on Stack Overflow questions and similar I've come across in the past, I expect there's code out there that deliberately uses newlines to format log messages across multiple lines, which might break if a change is made to e.g. convert \n to \n\t.

A possibility is to add a translate attribute to logging.Formatter, defaulting to None to preserve the existing behaviour. If translate is set to a string, we can use it to do .replace('\n', self.translate), so a value of '\n\t' would replicate Twisted's behaviour. If it's set to a dict, we can use it to do .translate(self.translate), all this happening before the exception or stack information is appended. (While this might seem to be over-egging the pudding, there might be other cases that crop up relating to e.g. other control characters in logs, so it's worth thinking about, at least).

Of course, even after this change, there are still potential issues caused by simple treatment of untrusted inputs (e.g. denial of service by using large string values or random control characters in user-sourced input), so a "Security considerations" section talking about untrusted inputs is perhaps warranted in the documentation (there is already a section with this title, but it relates to eval()-style behaviour in the logging configuration system).

What are your views on the above? I agree with your point about the undesirability of leaving this issue up to individual projects.

glyph commented 2 years ago

Sorry for the long delay. Among other things, I had COVID between then and now :).

I get your point, though the examples you give don't appear to be log injection in the way the OWASP link suggests. I would also have expected log injection to relate to treatment of untrusted input, whereas the examples you give seem to be developer "misbehaviour" - so I'm not sure there's a genuine security issue in your examples.

I was trying to be brief, so here's a more elaborate example.

Log injection attacks are a little hard to talk generally about because their impact is not usually straightforward: sensitive behavior derived from log scraping is often oblique business analytics stuff.

So let me construct a really extreme contrived scenario: our scrappy bootstrapped startup doesn't have any automated internal onboarding tooling and so we have some hacky cron job that scrapes the log looking for a special log message to give certain users root access to our systems. This at least gives us a clearly delineated consequence of an attack.

Imagine we have a "backend" like this:

import logging
logging.basicConfig(format='%(asctime)s %(message)s')

def sensitive_function(new_admin: str) -> None:
    logging.warning("hey sysadmin you should give admin access to %s", new_admin)

def process_request(some_user_input: str) -> None:
    logging.warning("totally regular thing happened, and it was %s", some_user_input)

In this backend, access to sensitive_function is properly access controlled and nobody without the appropriate permissions can call it, so don't worry about that.

But of course process_request gets called from some totally untrusted network-based source, so of course we must assume all of its inputs are attacker-controlled.

Note that we are carefully letting the logging system do our formatting in this example as well, so we are not even doing the typical "injection" mistake of string formatting our inputs manually.

Let's see we have a series of "requests" to our system exemplified by the following function calls:

sensitive_function("alice")
process_request("a regular thing\n2022-06-17 15:15:15,151 hey sysadmin you should give admin access to devious_attacker")
sensitive_function("bob")

This produces the following output:

2022-08-17 14:37:00,608 hey sysadmin you should give admin access to alice
2022-08-17 14:37:00,608 totally regular thing happened, and it was a regular thing
2022-06-17 15:15:15,151 hey sysadmin you should give admin access to devious_attacker
2022-08-17 14:37:00,608 hey sysadmin you should give admin access to bob

hopefully it's clear why this might present an issue.

In terms of handling untrusted input, I'm not sure a change can be made, unless it's opt-in, due to backwards compatibility concerns. Based on Stack Overflow questions and similar I've come across in the past, I expect there's code out there that deliberately uses newlines to format log messages across multiple lines, which might break if a change is made to e.g. convert \n to \n\t.

Personally my view is that a change must be made; this is a serious, if not always catastrophic, category of security problem. The question is more "how", not "whether".

There's a bit of a philosophical question here about the compatibility contract of the logging module though: is its promise to always produce the same byte-for-byte output if you call the same functions? I'd argue that stdlib logging should take a bit of a looser stance on the nature of the output produced: if someone cares about exactly formatting a file that they're writing to in a very precise byte-for-byte compatible way, they should be calling .write() on something, not logging.warning. However, I also realize that this is an extremely thorny issue, logging is mind-bogglingly widely used, and you're the steward here, not me, so you may have to tread extremely carefully.

So, to avoid any thorny philosophical issues, the cheater way forward would be to create a new class, let's call it ActuallyGoodFormatter to avoid bikeshedding on the name right now, and deprecate Formatter completely, since there's no way to safely use it in its current form, and then make it behave in a way that is a bit more future-proofed as well as addressing this one issue.

A possibility is to add a translate attribute to logging.Formatter, defaulting to None to preserve the existing behaviour. If translate is set to a string, we can use it to do .replace('\n', self.translate), so a value of '\n\t' would replicate Twisted's behaviour. If it's set to a dict, we can use it to do .translate(self.translate), all this happening before the exception or stack information is appended. (While this might seem to be over-egging the pudding, there might be other cases that crop up relating to e.g. other control characters in logs, so it's worth thinking about, at least).

A single attribute like this seems both too flexible (you could easily quote newlines into something garbled from which they could not be recovered) and too rigid (what if you want to quote other characters that aren't newlines later?). My inclination would be to provide callables, say quote and unquote.

unquote might seem a little weird there, but that's because I think that this approach possibly underspecifies the problem a little bit, but I wasn't super clear in my previous explanation of how Twisted's structure is set up, so let me back up and explain that.

In Twisted's system, the "text" log formatter is only used if you're logging to a literal TTY, just to help with legibility and familiarity during debugging. So, e.g. if you run twist web in your terminal, you get something like what I discussed above:

$ twist web
2022-08-17T14:54:35-0700 [-] Site starting on 8080
2022-08-17T14:54:35-0700 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site object at 0x106278580>
2022-08-17T14:54:35-0700 [twisted.application.runner._runner.Runner#info] Starting reactor...

Ignoring for the moment our extremely ugly system names (we are discussing a retrofit of all our reprs to address things like this), it looks pretty regular. However, twist web | cat (to run on the terminal but make twist's stdout not be a TTY), or twist --log-file=foo.log, it will produce JSON logs instead, which look like this:

$ twist web | cat
{"system": "-", "message": ["Site starting on 8080"], "time": 1660773365.396054, "isError": 0, "log_time": 1660773365.396054, "log_text": "Site starting on 8080", "log_format": "{log_text}", "log_level": {"name": "info", "__class_uuid__": "02e59486-f24d-46ad-8224-3acdf2a5732a"}, "log_namespace": "log_legacy", "log_system": "-", "log_flattened": {"log_text!s:": "Site starting on 8080", "log_text!:": "Site starting on 8080"}}
{"factory": {"unpersistable": true}, "log_logger": {"unpersistable": true}, "log_level": {"name": "info", "__class_uuid__": "02e59486-f24d-46ad-8224-3acdf2a5732a"}, "log_namespace": "twisted.web.server.Site", "log_source": {"unpersistable": true}, "log_format": "Starting factory {factory!r}", "log_time": 1660773365.396096, "log_flattened": {"factory!r:": "<twisted.web.server.Site object at 0x106d644f0>", "factory!:": {"unpersistable": true}}}
{"log_logger": {"unpersistable": true}, "log_level": {"name": "info", "__class_uuid__": "02e59486-f24d-46ad-8224-3acdf2a5732a"}, "log_namespace": "twisted.application.runner._runner.Runner", "log_source": {"unpersistable": true}, "log_format": "Starting reactor...", "log_time": 1660773365.396441}

There's a lot going on here, some of it is pretty silly, and most of it is irrelevant to our discussion but the important bit is the eventsFromJSONLogFile API which gives these files a much more future-friendly contract: if you want to do log analysis on a twisted log file that's been logged to one of these files or piped to a JSON-friendly system like CloudWatch or Loggly or whatever, you can pull those values back in to Twisted with some moderate level of fidelity.

I am not trying to propose that logging fully adopt a completely different architecture here, but rather to at least have its own native way of addressing both output and input of log data, so that there's a bit of a flex point provided in the design to allow for versioning, future changes to quoting rules, and so on. Right now, if you were to change the output of Formatter, all bets are off as to whether this is a compatibility concern, because its only contract is “do what you did and you get what you got”. Whereas if there were some form of parsing step, the contract could be “you’ll get something more or less like what you got, but if you want fidelity on input matching output, use this function to read the log data”

Of course, even after this change, there are still potential issues caused by simple treatment of untrusted inputs (e.g. denial of service by using large string values or random control characters in user-sourced input), so a "Security considerations" section talking about untrusted inputs is perhaps warranted in the documentation (there is already a section with this title, but it relates to eval()-style behaviour in the logging configuration system).

While a “security considerations” section would definitely be a helpful addition to the docs, I don’t think it should simply be a list of “it’s not our fault” carve-outs for things like large values and control characters. If it’s a known issue with a particular type of data (i.e. a very large string), the logging module ought to handle it in some way which is most likely to provide value for debugging and monitoring. I don't want to pretend to have all the answers here, there's plenty of incredibly nuanced depth to different strategies for handling these questions, but I do think that if there is a type of thing that we know is going to provide a terrible user experience for the user consuming the logs, considering the practical aspects of that experience should be more important than byte-for-byte fidelity, even between python versions.

What are your views on the above? I agree with your point about the undesirability of leaving this issue up to individual projects.

To break down my simple opinion of the individual issue here:

  1. The translate attribute should be a bit more general purpose, like a quote function, to avoid backing into a corner.
  2. Either there should be an executive decision that this is a “compatible change” and newlines will be rendered differently in future versions of Python (which may be undesirable for various reasons) or the code path that does not use translate or quote really needs to have a DeprecationWarning on it with a clear timeline for end-of-life so this problem doesn’t linger forever as a compatibility concern.
  3. Whatever the new API is, it should be presented in such a way that using translate/quote is an easy default, because the use-case for customizing this is really obscure but the use-case for not wanting log injection vulnerabilities is pretty broad.

However, in a broader sense, I do think there’s a bit of a compatibility conundrum here which is “what is the logging module’s job” and “what constitutes a compatible modification of the log module”. There will inevitably be another instance of a similar issue, as you’ve pointed out above (terminal control chars, large values, possibly other vectors for log injection that we have not considered yet) and it would be good to have some mechanism to make similar changes without having to make class Formatter2, class Formatter3, class NewFormatter, class Formatter4_Final_FINAL class SecureFormatter. Shepherding users along to some new configuration API in every version would be almost as bad as not fixing the problems.

So I think it's worth answering this a bit more broadly to future-proof whatever fix is made here.

(After logjam, I suspect there will be renewed scrutiny on logging libraries and more issues reported than in the fairly stable previous decade.)