n8n-io / n8n

Free and source-available fair-code licensed workflow automation tool. Easily automate tasks across different services.
https://n8n.io
Other
45.52k stars 6.27k forks source link

Error Detail Missing in Execution View #8793

Open dkindlund opened 6 months ago

dkindlund commented 6 months ago

Bug Description

I've been trying to troubleshoot random n8n workflow errors for multiple days now, and I'm getting frustrated by the lack of detail in the execution view that's offered by default. Let me explain -- take a look at this example error:

image

My questions are simply: In this view, how can I figure out what was the underlying error? Which node do I click on? There's no individual warning icon indicating to me which node I should focus on.

If I zoom in to just the subset of nodes that are "green"...

image

If I click on each of those node details, I can't find the original error at all.

In fact, the only way for me to figure out the underlying error, is by setting up an "Error Workflow" and then reviewing the contents of that workflow's output -- but here's the thing... there's no forward link from the original workflow execution pointing to the corresponding Error Workflow execution that maps to the underlying error!

Instead, right now, I'm left having to piece together this puzzle manually based on manual Slack notifications I've setup -- joined by the workflow execution ID:

image

In short, I believe that this feature is misleading:

image

^ I assume that when it's enabled, the full error details of failed executions should also be saved, but it looks like that's not happening here.

To Reproduce

Generate any sort of workflow error and then try to figure out where the error is located.

Expected behavior

I should see all types of errors in failed executions -- including out of memory errors.

Operating System

Google Cloud Run

n8n Version

1.30.1

Node.js Version

18.10

Database

PostgreSQL

Execution mode

main (default)

Joffcom commented 6 months ago

Hey @dkindlund,

Looking at the workflow I would say the error occurred on the Airtable node but more information would be needed.

Looking at the output you collected from the error trigger that may also have been in the n8n log it would suggest the issue occurred because of a memory issue this means the node never really got a chance to start.

You are not wrong we really should put this information in the UI somewhere but as it is a workflow level error it wouldn't be right to put it under the node output so we would need to think about how to best display it.

I suspect when the workflow process runs out of memory though it doesn't have the memory to add that to the node which is why it isn't there. We should probably make it clearer as well that the settings are for the workflow itself and not the system in general.

This isn't really a bug but I will keep this open and get a dev ticket created on Monday to look into how we can improve this.

dkindlund commented 6 months ago

Thanks for the analysis, @Joffcom -- I agree it's a hard problem. Just trying to offer a user's perspective about it for now. Thanks!

dkindlund commented 6 months ago

One other point: When I checked Google Cloud Run's memory usage in the single container around the time when this out-of-memory error was reported, I see that only ~15% of the container's memory was actually allocated:

image

Then, when I checked the logs, I see this sort of activity:

image

So the timeline of events appear to be:

We're left with a bunch of questions/insights, such as:

1) Why did the container crash to begin with? Looking through the older logs, there were no entries that provide any clues as to why the container crashed.

2) When attempting to recover a crashed workflow, that recovery logic appears to trigger out-of-memory issues even though the container had more than enough memory allocated at the time. (I suspect there might be some sort of out-of-memory bug in n8n's workflow recovery logic that can be hard to pinpoint.)

dkindlund commented 6 months ago

A couple of other data points about this n8n deployment:

dkindlund commented 6 months ago

Oh, this might be a factor:

image

So essentially, Google Cloud Run can kill/restart the container at any time to run it at a cheaper rate -- so not necessarily because of any sort of n8n error.

I guess the main issue is: n8n's workflow recovery logic doesn't quite work correctly upon container restart -- hence the spurious out of memory errors we're seeing.

Joffcom commented 6 months ago

Ah yeah cpu will cause a similar message, we don’t have container restart logic to restart workflows though that is something that needs to be manually done.