launchdarkly / node-server-sdk

LaunchDarkly Server-side SDK for Node
Other
79 stars 65 forks source link

Fix/throw err on malformed logger #185

Closed maxwellgerber closed 4 years ago

maxwellgerber commented 4 years ago

Requirements

Related issues

During LaunchDarkly's April 21 incident, our sdk (5.10.3) reached the following code:

https://github.com/launchdarkly/node-server-sdk/blob/f95665c68c37bc81d47274526ba4abac8ed0bcbe/streaming.js#L33-L34

This was the first time something in the ld library had called logger.warn.

Our logger calls this method logger.warning

The result:

/home/[OMITTED]/node_modules/launchdarkly-node-server-sdk/streaming.js:34
        config.logger.warn(message);
                      ^
TypeError: config.logger.warn is not a function
    at EventSource.es.onerror (/home/[OMITTED]/node_modules/launchdarkly-node-server-sdk/streaming.js:34:23)
    at EventSource.emit (events.js:321:20)
    at EventSource.EventEmitter.emit (domain.js:482:12)
    at _emit (/home/[OMITTED]/node_modules/launchdarkly-node-server-sdk/eventsource.js:194:17)
    at IncomingMessage.onConnectionClosed (/home/[OMITTED]/node_modules/launchdarkly-node-server-sdk/eventsource.js:44:5)
    at IncomingMessage.emit (events.js:333:22)
    at IncomingMessage.EventEmitter.emit (domain.js:482:12)
    at endReadableNT (_stream_readable.js:1204:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)

An error being thrown in an EventEmitter's error handler causes the process to exit. Try running:

const EventEmitter = require('events');

const myEmitter = new EventEmitter();
myEmitter.on('error', (err) => {
    throw new Error('Yikes');
});
myEmitter.emit('error', new Error('whoops!'));

All our services restarting at the same time, across all our environments, was very exciting, but I would really like to not have it happen again.

Describe the solution you've provided

I think it makes sense for configuration.validate to throw an error when passed a logger that doesn't support all the log levels LD uses internally. This way, mismatched logging APIs will be caught way earlier, before code makes it into production.

It would also be great to update the docs: https://github.com/launchdarkly/LaunchDarkly-Docs/pull/38

Describe alternatives you've considered

N/A

Additional context

N/A

eli-darkly commented 4 years ago

The required interface for logger objects is already spelled out in the SDK's TypeScript-based API documentation for LDLogger.

If you're not using TypeScript, then it's not checking for compliance to that interface, and you're right that we could implement some validation logic to do it programmatically. But it's hard to say how far to go with that kind of thing— e.g. just because an object has a warn property does not mean it's a function that takes the parameters we expect it to take. Since there's already a popular mechanism for providing something resembling type-safety in JavaScript, and that is TypeScript, we've generally not tried to do runtime type checking on everything that is passed to us, but have been careful to keep our TypeScript definitions up to date. Those definitions, and the HTML docs that are generated from them, should be considered the source of truth if you're ever not sure what the supported types or properties for something are.

Now, we don't have an absolute rule of not doing safety checks: as you can see in configuration.js, we still have implemented runtime type checking to some extent, specifically in the configuration object. However, there are three problems with applying similar logic to the logger:

  1. Our normal mechanism for reporting that something is wrong in the configuration is... logging a warning. You can see why that'd be a problem in this case.
  2. In general we have tried pretty hard to avoid using exceptions as an error-reporting mechanism. You can still cause errors to happen if you pass very invalid values to various methods, if we did not write the code to tolerate those values... but for anything where we actually know what the problem is, we have always either a) reported the problem using some special return value if possible (like the error reasons available from variationDetail()), or b) logged a warning and chosen some well-defined strategy for graceful degradation. So, in the current configuration.js logic, if you pass a non-numeric value where a number is expected, you won't get an exception—it'll simply ignore that value and fall back to the default value for that property (and log a warning). If we used similar logic for this, falling back to the default would mean using a Winston console logger. Deliberately throwing an exception would be unusual.
  3. As I mentioned earlier, validating that the logger very broadly conforms to the expected interface (like, it has a warn property that is a function) does not really guarantee that it is safe to use as a logger. You could always have a broken logger implementation whose warn method throws an exception, and then you'd have exactly the same problem you ran into here.

Especially because of number 3, I'm thinking that it might make more sense to add an error-proofing wrapper around whatever logger we are using. It is indeed a very bad thing to have an uncaught exception happen while you're trying to handle an error, so if we do all of our logging against the wrapper and it always delegates to the underlying logger with a try block, we can at least prevent that. The question would be what to do at that point— log a warning to the console?

maxwellgerber commented 4 years ago

I don't have much experience providing a library to a large number of external clients, but it certainly sounds like y'all have spent a lot of time and effort creating a POV on sane defaults and library behavior. Thank you for the incredibly detailed writeup. You make a ton of great points I hadn't considered before.

You are correct that we aren't using TypeScript. Up until this point, we've been getting away with just the SDK reference. I now see that it calls out the TypeScript generated docs in the beginning.

There are a ton of fun things that make this problem interesting. Consider the following more of a ramble than anything else.

reported the problem using some special return value if possible

It's hard to see where the return value could come from - considering setting logging isn't specific to any call but the constructor. I suppose the init method could return a wrapped { LDClient, errors[] } object but that would be a breaking change and a pretty ugly API (IMHO).

In general we have tried pretty hard to avoid using exceptions as an error-reporting mechanism

There are two places where the library throws errors today: when no SDK Key is passed in and when the FileDataSource is given an invalid file. Both of these are errors on startup - not during operation. To borrow the K8s terminology, throwing errors in a readiness check is preferable to failing a liveness check later on. Subjectively, I as a developer would want my code to fail fast - error out once or twice when setting up LD locally, and be rock solid in production.

it'll simply ignore that value and fall back to the default value for that property (and log a warning)

Having bad config values be replaced with sane defaults is also desirable. I've liked that same strategy with other infra tools I've used(AWS's DB Parameter Groups being the first example I can think of). However - I think loggers are executable code, not config. More along the lines of dependency injection than anything else. What would the behavior be if a caller provided a featureStore that didn't implement all the required methods? Most of them are called on startup, I think. I imagine an error would be thrown pretty quickly.

Calling all 4 log levels on startup would produce a ton of noise. I don't think that is a good approach.

Another major issue is dropping logs. Falling back to logging to the console as a first step is dangerous because it's very easy for those logs to get lost - or be only available on the production servers and not in whatever log aggregator someone might be using.

I'm thinking that it might make more sense to add an error-proofing wrapper around whatever logger we are using.

I like this idea too - and you make an excellent point about the error-handling - but I don't see any reason we can't have both. Fail fast if the logger passed in obviously won't work, and be cautious when calling into it at all other times.

Summary

I think a log instance should be thought of as dependency injection, not config - and should be approached the same way featureStores are approached. I like the idea of having an error-handling adapter, but I don't think it solves the undefined method case very well. I also don't know what to do when the logger throws an error.

Let me know if you'd rather see the error handler approach, or the error handler + config validation approach. I'll update the PR accordingly.

Thanks!

@gkop I'd love to hear your thoughts on this too

eli-darkly commented 4 years ago

Those are all good points. And because the first case you mentioned here, which I had forgotten about—

There are two places where the library throws errors today: when no SDK Key is passed in

—is part of ld.init(), and the configuration validation is only called from within init(), we would not really be defining a brand new behavior of init() being able to throw an exception. We would just be extending the category of "parameters that are so invalid we can't proceed at all" to include a malformed logger object as well as a missing SDK key. And since the logger is not just heavily used throughout the SDK, but specifically necessary for reporting errors, that is probably reasonable.

So, I think it'd make sense to 1. throw an exception from config validation if the logger does not have all four methods, and 2. also have the config validation code wrap any custom logger that's provided by the app in a delegator that swallows exceptions. There wouldn't be much we could do about an exception at that point except to write something to console.log(), but at least it wouldn't kill the app, and even if the console output wasn't noticeable, you might at least notice that you were not getting log output.

About other configurable components, I'm less convinced that we should add extra validation. You mentioned the feature store; there are definitely some methods there that are not called during startup, so if you completely omitted them, an error would only show up at some later time. But that gets back to not being able to prove that any implementation, even if it has methods with the right names, isn't broken. To some degree, if you choose to provide a custom component, you are taking on the responsibility of making sure it is usable and that all of its methods have been tested before you put it into production. Also, the logger is the only case where we have the specific issue of a broken component also breaking the error-reporting mechanism.

eli-darkly commented 4 years ago

I think this looks good. I'm going to merge it into an alternate branch rather than directly onto master in case we still need to adjust anything, but it should go into the next release in some form. Thanks!