senecajs / seneca

A microservices toolkit for Node.js.
http://senecajs.org
MIT License
3.95k stars 314 forks source link

Error Handling Rethink #398

Open mcdonnelldean opened 8 years ago

mcdonnelldean commented 8 years ago

Error handling needs a rethink, my thoughts here,

A lot of why error handling is bad is due to cryptic messages. We need to rethink the sort of messages we are returning. Gitter is littered with people asking about gate executor timeout questions because they are basically unreadble if you don't know the code.

We are also way to safe in my opinion. We shouldn't be trapping errors, the system ends up in very peculiar states. I'm a firm believer in fail fast and restart. From gitter and issues it does seem people's expectation is fail fast, we really need to consider this.

Bear in mind, our own tests are in the state they are in due to weird error handling, if we had untrapped exceptions and proper err flow we could cut our test suite in half and end up with more coverage.

Contributors and users, please feel free to add pain points and constructive feedback.

AdrianRossouw commented 8 years ago

@rjrodger what release did you want to tackle error handling/logging for ?

mcdonnelldean commented 8 years ago

@AdrianRossouw It'll be major for sure.

Wardormeur commented 8 years ago

We're not on the lastest cause, well, time & priority. Also, some of my remarks may be related to how our stack is built, so that it may not be relevant. From a 2 months of dev perspective, the need to have the possibility to understand from WHERE a call has been done before failing (when having cross microservices calls) is crucial from a debug perspective. Also, yes, the truncated messages are.. not helpful to reproduce or understand the call context. Or timeouts issues which are triggered when the real issue is 2 minutes before, but the whole call stack (asyn.each or eq) fails afterwards by "timing out". Gateway errors as you report them @mcdonnelldean are often a misimplementation of an act from what I've been experiencing, but could be more descriptive for sure. On production side, our concern is to not have to get our heads inside the logs, but to have logentries to warn us, which is another concern : will logentry be able to parse our logs? Or use vidi to make a visualisation of errors? ;)

mcollina commented 8 years ago

I would like a major bump to remove the timeout on messages, and possibly adding it back as a "filter" or plugin for some very specific microservices. I think the timeout is awesome for entry-point microservices, and maybe for Chairo.

mcdonnelldean commented 8 years ago

@mcollina Do you want seneca to hang in the case of a block at the other side? I'm trying to work out how we could do this and want to be clear as possible.

mcdonnelldean commented 8 years ago

Just a heads up @Wardormeur manages CoderDojo's Zen platform.

mcdonnelldean commented 8 years ago

Added related issues.

AdrianRossouw commented 8 years ago

The timeouts were responsible for cascading failures at higher loads on a system that I worked on before.

Poor design decisions around how entities were used meant that each entity load / list would fan out into multiple actions loading more entities, until eventually the node callback queue would get so long that the time it takes to respond to an action would surpass the hardcoded seneca message timeout.

so instead of things failing reasonably (as in, one can reason about it) in one place as matteo suggests, it would fail all over the place in a way that had no useful debugging info.

rjrodger commented 8 years ago

Please also note: http://senecajs.org/contribute/principles.html - Continuity

Error handling rethink != major breakage is acceptable.

The challenge here is not just to rethink the error handling, but also to ensure you have a backwards compatibility path. That means code that depends on the old way still needs to work, by using an option, or a plugin.

The rule for major releases is: one line of code (typically a new plugin) is all you need to make existing code work. It is also acceptable to have an option flag to revert to previous behaviour.

It is an absolutely core value of the Seneca community that existing users are respected and that we do not break existing apps.

This does make the work of fixing error handling harder, I know :)

/cc @mcdonnelldean @AdrianRossouw @mcollina @davidmarkclements @pelger @geek

davidmarkclements commented 8 years ago

@rjrodger - I wonder if, for each area, can we have a two step approach?

Step 1 - unencumbered reimagining

here thoughts can run free, protocols/implementations do not consider backwards compatibility

this would be "seneca-next-next" ... postmodern seneca

Step 2 - shimming

once we have nailed down "the best way" then we create an adapter that delivers backwards compatiblity.

If necessary, tweaks may then be made to the pristine implementation, but only in support of the shim.

this would be "seneca-next"

mcollina commented 8 years ago

I tend to agree with @davidmarkclements, with a catch. I think we should try to implement the new features without breaking, but not being afraid to break the current API if we see no alternative. If such an event happen, we go and seek a solution (maybe one more module). So this became a two step problem, which is easier to solve.

mcdonnelldean commented 8 years ago

I agree with the principle but we don't have the time or resources for the above. Realistically what is going to happen is we bludgeon our way through this until we have something smaller and more composable and around version 6 we can start blue sky thinking.

We need to be a little more careful here. For instance @Wardormeur has a fairly big system we will need to run some tests against (CoderDojo) ensure we break as little as possible.

Basically it must be reasonably backwards compatible within a version. We cannot expect a user to have to spend two days changing code to make v.next work, thats just not going to fly with customers who have built big systems.

davidmarkclements commented 8 years ago

To be clear, I'm saying we supply our shims wholesale - v.next is v.next.next + shim

The judgement call is on whether it will save more time and resources in the long run to to follow the ideal implementation and work backwards to a shimmed version or to deal with the backwards compat overhead and strive towards the eventual rainbow of perfection

mcdonnelldean commented 8 years ago

My concern is pretty much around the volume of work to get to there given we do timed releases. Back compat to a single major version is our red line which leaves our cycle of change tighter again. Generally backwards compat means you will only have to make minor changes which roughly equate to adding a shim.

If you can change the underlying implementation but still 'add a plugin' that maintains our promise that it will all work the same if you add the shim then we are pretty much open to any change you want (as long as the change is in line with the spirit of the toolkit).

So yeah, my chief worry is back compat, if that's respected I'm happy :D

Wardormeur commented 8 years ago

I don't expect you guys to ensure backwards compat at 100% Updating always goes through some changes at some point, you must adapt your code. So don't be afraid to "break" if it a search/replace/add param kind of work, and not whole redesign. If it's a 2 day process that sortof fine, if it's a week, that's something that blocks our dev process for some time and may slow down the adoption of your new version. That's my only concern ;)

mcdonnelldean commented 8 years ago

@Wardormeur yeah that is pretty much the sort of break I am talking. If I can just replace a bunch of plugins or do a simple find and replace then It's good. If it's any more than that we start losing projects to rot since the work gets pushed out. Breaking bad usually means folks get left on older version and we end up with a support nightmare rather than people moving up versions.

With breaks we pay up front or in the future. I'd rather pay up front and bring our community along with us up the versions.

mcdonnelldean commented 8 years ago

@davidmarkclements Sorry to leave off on this. I'd like to organise a chat with you at some stage around this shim concept? We have a few other objectives that need a similar idea. I'd like to get a practical overview on what it means?

OriginalEXE commented 8 years ago

Just ran into error handling "issue", pasting my comment from gitter.

Context: I was running into issue where my service would fail because seneca transports do not transport errors (even with fatal$ set to false). This was completely unexpected for me so I lost quite a lot of time just figuring this out, and logs did not help me.

Well it was really unexpected for me. For example, in this case these were validation errors (like, your username is taken, or, your email is not valid), so I definitely wanted to propagate the error to my web server in order to let my api consumer know what he did wrong. Of course, I could just not use the err object (at least not as the first parameter of the reply), but it's counter-intuitive to me to say there were no errors when there were. I mean, if I myself created an error object, then I probably know how to handle it, and want to handle it myself. I understand this is logical for programmer errors which you can't really recover from, but for operational ones I believe it should be possible. Maybe I'm wrong, but that doesn't change that fact that I expected this behavior (and I was not the only one), so it definitely needs to be pointed out very prominently in the docs. As for whether the behavior needs to change (by default or with some kind of flag), I'm up for that discussion.

mihaidma commented 8 years ago

@OriginalEXE In my understanding there are two big error categories:

mcdonnelldean commented 8 years ago

@OriginalEXE @mihaidma An this I think is the issue, we have no way to express business logic errors as such. This is further exacerbated by the fact that their is no consensus as to the standard way to do business logic errors in node.

OriginalEXE commented 8 years ago

I agree. I like this writeup on errors in node: https://www.joyent.com/developers/node/design/errors

Also, the library they mention for aiding the error propagation seems interesting: https://github.com/davepacheco/node-verror

So yeah, there really needs to be a way to handle operational errors other than passing the error as part of a message object and having to check for some kind of "success" flag when you receive response. This is even more annoying when you use seneca.act as a promise, as it's really logical imo to use .catch for handling the (for example) input validation failure.

mcdonnelldean commented 8 years ago

@mcollina Could you add your summary from #402

The end result of the spike for /all. We can't just remove try-catch, it sets off a series of changes that ends with needing to make a decision on error handling and some large breaking changings. Before we continue, I want to be really clear about what we are changing and make sure that each thing we add or remove is accounted for. This means we need to capture the current and proposed error flow before we start messing with it.

I really want to make sure we capture @OriginalEXE's user story above. Removing Error capture will get us cleanly failing systems and better errors but it still won't fix the major community issue of trying to deal with system level errors vs business logic level errors which is the larger priority of this work.

@rjrodger @mcollina, et al, it may be a good idea to start looking at some design docs to iterate on. They can go in the /docs folder for now. We need to capture the new and existing flow in a PR / Iterative friendly way, markdown doc in this repo seems the most sensible?

mcdonnelldean commented 8 years ago

Nothing https://github.com/senecajs/seneca/issues/133 here as it's pretty much the same issue as @OriginalEXE speaks about but the other way around. All roads lead to having to types of error scenarios but only one way to handle them.

davidmarkclements commented 8 years ago

@mcdonnelldean just caught up with thread, happy to talk about shims but.. just to clarify, are we at a point where everyones okay with clean break + shim, or is that whole discussion still being had? If so we still need to nail down prevailing strategy and context, which is a bigger conversation requiring more input than I should or could provide alone (i.e. cc @rjrodger @pelger @mcollina ....)

mcollina commented 8 years ago

@mcdonnelldean it would be good if @rjrodger (or someone else) writes down how things are supposed to work in the current version, not how they work from a code point of view. This will help clarify the scenarios Seneca is trying to cover, and provide some good documentation for v3. Then, we can iterate anche change that document.

rjrodger commented 8 years ago

@mcollina good idea - let me get this on paper

dgonzalez commented 8 years ago

+1 to write down on how things are supposed to work in current version for error handling.

StarpTech commented 8 years ago

+1 @rjrodger did you start on this paper?

mcdonnelldean commented 8 years ago

@StarpTech This is starting post 3.0.0, we are focusing on getting that out first and then we will move on to documenting this stuff in more detail before 4.0.0 so we can get the community in on the decision making process. As soon as the draft becomes available I'll update this thread.

StarpTech commented 8 years ago

@mcdonnelldean is it possible to track the progress on 3.0 or how it is possible to contribute?

Something like this https://developer.microsoft.com/en-us/microsoft-edge/platform/status/ or a realtime board with trello https://trello.com/b/EceUgtCL/ghost-roadmap

This is much easier to maintain e.g https://github.com/OpenIotOrg/openiot/wiki/Roadmap-Implementation-Status

jagin commented 8 years ago

Hello, any news about 3.0.0 - June 2016 release?

mcdonnelldean commented 8 years ago

Hey,

Yes there will be an update to the roadmap tomorrow to correct it and realign priorities. I can ping this message when done.

Kindest Regards,

Dean

On 25 Jul 2016, at 16:16, Jarek Gilewski notifications@github.com<mailto:notifications@github.com> wrote:

Hello, any news about 3.0.0 - June 2016 release?

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/senecajs/seneca/issues/398#issuecomment-234984230, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADdT6Un62x9P4kd1o6PjMpTny2Av4AxDks5qZNNTgaJpZM4IFXW8.

StarpTech commented 8 years ago

@mcdonnelldean There is no updates since one month. Can you describe what happens with it? Whats the plans?

mcdonnelldean commented 8 years ago

Hey Dustin,

No immediate plans as of yet. 3.0 has just gone live and we are working on module compatibility right now.

This will probably be picked in a couple of months once we have a clear picture of how 3.0 is bedding down.

Kindest Regards,

Dean

On 1 Sep 2016, at 16:32, Dustin notifications@github.com<mailto:notifications@github.com> wrote:

@mcdonnelldeanhttps://github.com/mcdonnelldean There is no updates since one month. Can you describe what happens with it? Whats the plans?

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/senecajs/seneca/issues/398#issuecomment-244117567, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADdT6WI0UwGLyz5HEztlVN10PEzaKhHtks5qlvAWgaJpZM4IFXW8.

StarpTech commented 8 years ago

@mcdonnelldean in my opinion error handling is a big picture and need good imrovements. Do you planning to take in the next roadmap?

mcdonnelldean commented 8 years ago

@StarpTech I don't think there is anyone here that doesn't agree Error handling needs to be sorted, hence this Issue item. Seneca 3.0 is just out, we have an obligation to folk to ensure current modules are in working order; standard practice for any toolkit this big.

Once this work is done we will re-examine the roadmap.

StarpTech commented 8 years ago

Hi @mcdonnelldean I just ask this because I cant find it in 4.0, 5.0...

mcdonnelldean commented 8 years ago

Thats because it hasn't been planned yet. As mentioned already we are still bedding down 3.0 and making sure everything is ok there. After that we will look at what's next for further versions

Kindest Regards,

Dean

On 2 Sep 2016, at 07:20, Dustin notifications@github.com<mailto:notifications@github.com> wrote:

Hi @mcdonnelldeanhttps://github.com/mcdonnelldean I just ask this because I cant find it in 4.0, 5.0...

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/senecajs/seneca/issues/398#issuecomment-244292765, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADdT6Y2ZWd3TtqI96Hbn6M1J1MhVkHwWks5ql8AXgaJpZM4IFXW8.

StarpTech commented 8 years ago

Then I don't understand the prioritisation but thanks for clarification.

StarpTech commented 7 years ago

Any updates? No progress since 2 months.

StarpTech commented 7 years ago

@mcdonnelldean @rjrodger anything?

StarpTech commented 7 years ago

This is not the way to go. There are lots of people who have issues with that and they dont know where they are going. No comment, no Roadmap and too less documentation to work on it. I create https://github.com/hemerajs/hemera it solves all my problems.

JoshDobbin commented 7 years ago

Thanks for making this amazing framework. I found it and odd choice to make errors passed in the first argument of the callback fatal but I didn't take the time to write and share an awesome framework so I can't complain.

My solution was to create a wrapper function for Seneca that can be used everywhere and then include this line inside the wrapper function somewhere after Seneca had been instantiated:

seneca.fixedargs.fatal$ = false;

It still leaves me with the problem of determining how to handle fatal errors, but this way my whole team doesn't have to be retrained on a new way to utilize callbacks that doesn't follow the norms for other libraries.

Again thanks. Great framework.

tswaters commented 7 years ago

We also have seneca.fixedargs.fatal$ = false set on all microservices.

Going forward, it would be nice if fatal-type errors that seneca throws itself (such as act_not_found) have a meta key attached for fatal$: true, whereas if you pass an error to the callback of an action, it doesn't assume the error to be fatal and will just return it to the caller as an err.

On that note on returning err to the caller - it would be really nice if the actual error that was passed back looks identical to the one you provided. Right now it creates a new error object with a bunch of seneca-specific stuff on it like plugin, pin, args, etc. it mangles the message to include a bunch of extra stuff.... it's not ideal for actually dealing with business-specific errors throughout an application.

I find I need to do a lot of checks right now for err.orig$ to get back my original error to figure out why the action failed and for what reason, which isn't ideal.... it would be nice if there was err.seneca$ with all the additional information such as plugin, pin, args, etc. If you want to see it, it's there - but the error you receive back still looks like the error you provided.

Here would be our ideal pattern --

seneca.add('role:entity,cmd:get', (args, done) => {
  someEntity.load(args.id, (err, entity) => {
    if (err) { return done(new DatabaseError(err)) }
    if (!entity) { return done(new NotFoundError(args.id)) }
    done(null, entity)
  })
})

And in the calling function, typically a seneca-web action using seneca-web-adapter-express - which now passes errors to next (yay):

seneca.add('role:web,cmd:get', (msg, done) => {
  if (msg.args.user == null) { return done(new UnauthorizedError()) }
  if (msg.args.user.role !== 'specific-role') { return done(new ForbiddenError()) }
  const id = msg.args.query.id
  if (!id) { return done(new ValidationError('id must be provided')) }
  seneca.act('role:entity,cmd:get', {id}, done)
})

And in our express app, we handle all the errors from anywhere along the action calls:

app.use((err, req, res, next) => {
  if (req.headersSent) { return next(err) }
  if (err instanceof DatabaseError) { return res.status(500).send('db error') }
  if (err instanceof NotFoundError) { return res.status(404).send('couldnt find ' + err.message) }
  if (err instanceof ForbiddenError) { return res.status(403).send('you cant do that') }
  if (err instanceof UnauthorizedError) { return res.status(401).send('please login') }
  if (err instanceof ValidationError) { return res.status(400).send(err.message) }
  res.status(500).send('unexpected')
})

This is a moderatly simple example - omitted is the localization of errors into different languages, performing replacements on specific keys in the error context, using parambulator to return the validation errors.... and a bunch of other stuff I've probably missed.

For this pattern to work, we need to do a bunch of extra stuff to get around err.orig$ being the actual error, and the instanceof not working properly because everything gets recreated after passing the transport boundary..... all solvable in userland. But to get any of it to work without services dying, we need seneca.fixedargs.fatal$ = false, which doesn't seem right.

agarbund commented 7 years ago

Hello, all. In the first place, I'd like to thank all involved in creating this great framework.

I found this topic when fighting with SENECA TERMINATED error messages after plugging seneca-amqp-transport to my project (we're returning business logic errors sometimes and before plugging transport plugin it was working fine). I read about seneca.fixedargs.fatal$ = false workaround but can't get it to work in mine case.

I created simple repo demonstrating my problem: https://github.com/agarbund/seneca-error-issue Could someone please explain to me what am I doing wrong here?

Glathrop commented 5 years ago

Is there more to this in the past year and a half?

Intensive Googling is leading me in circles. The FAQs reference fatal$ on plugin initialization but I see no reference to error propagation. The issue is still open and I'm not seeing a consensus in any error handling thread.

Are the workarounds here "working" so the thread is dead? We love many parts of Seneca, but the lack of clarity in how best to even think about errors causes some major issues in our program.

The explanation of existing error handling reads as a very literal interpretation of microservices where each individual plugin would also be an individually hosted container that dies and restarts independently of other containers. That can't be correct as hosting would be nightmarish. I am missing some fundamental Node.js process understanding? Correct me if I'm wrong but a FATAL = SIGTERM and is going to crash the container along with all the plugins in the container, not just the individual node process?

Our specific problems arise in a hosted K8 environment. When one pod dies due to a fatal error (typically TIMEOUT), all pods of that same type eventually fail as the message broker attempts to get a satisfactory response back. Yes, they restart and heal, but nothing is timely and the app is down for the time it takes for all pods to restart. The net effect is the same as if we had a more monolithic set up. The app is unusable.

Intuitively this feels like an incorrect configuration/setup/pattern in our code. That said, there are quite a bit of moving pieces in a modern hosted environment so who knows! It's challenging for a small team to have experience with the issues that arise going from one to many plugins, local to small distribution, etc. We can't be the only team feeling this way.

What would be very helpful to relative newbies like myself is a written explanation of how to think about this issue and why the workarounds "work". I'm hesitant to just start throwing fatal$ = false; in our codebase without truly understanding the larger ramifications or how to handle errors that are actually fatal and should lead to a SIGTERM.

Thanks to all who have contributed their time and energy to this. I know it's easy to just ask questions, but for the sake of future readers, it does seem time to stir the pot. I've admitted my lack of understanding so I'm happy to answer any clarifying questions if it can add to the thread.

Glathrop commented 5 years ago

Crickets...

s9tpepper commented 5 years ago

Do you have any remote idea as to what may be causing the errors? Like, is it a JS rte, null pointer of some kind? Or like timeouts? Or database?

Are the errors happening in your application actions?

Do your actions work at some point and then fail, so you know the code works (kind of) and then something happens?

-omar

On Sun, May 12, 2019 at 11:35 AM GL notifications@github.com wrote:

Crickets...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/senecajs/seneca/issues/398#issuecomment-491618548, or mute the thread https://github.com/notifications/unsubscribe-auth/AACHMRKT5NWQHPDRLBOVC7LPVBPNTANCNFSM4CAVOW6A .

-- Omar Gonzalez s9tpepper@apache.org

Glathrop commented 5 years ago

@s9tpepper timeouts are common.

s9tpepper commented 5 years ago

But I wasn’t asking if they’re common. I was asking for more details from Jarek as to get more context.

And they’re not that common if you know what’s timing out and you fix it.

-omar

On Sun, May 12, 2019 at 4:38 PM GL notifications@github.com wrote:

@s9tpepper https://github.com/s9tpepper timeouts are common.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/senecajs/seneca/issues/398#issuecomment-491638749, or mute the thread https://github.com/notifications/unsubscribe-auth/AACHMROZPCSTG7JEYDVVBTDPVCTANANCNFSM4CAVOW6A .

-- Omar Gonzalez s9tpepper@apache.org