Error Detail Missing in Execution View

dkindlund commented 6 months ago

Bug Description

I've been trying to troubleshoot random n8n workflow errors for multiple days now, and I'm getting frustrated by the lack of detail in the execution view that's offered by default. Let me explain -- take a look at this example error:

My questions are simply: In this view, how can I figure out what was the underlying error? Which node do I click on? There's no individual warning icon indicating to me which node I should focus on.

If I zoom in to just the subset of nodes that are "green"...

If I click on each of those node details, I can't find the original error at all.

In fact, the only way for me to figure out the underlying error, is by setting up an "Error Workflow" and then reviewing the contents of that workflow's output -- but here's the thing... there's no forward link from the original workflow execution pointing to the corresponding Error Workflow execution that maps to the underlying error!

Instead, right now, I'm left having to piece together this puzzle manually based on manual Slack notifications I've setup -- joined by the workflow execution ID:

In short, I believe that this feature is misleading:

^ I assume that when it's enabled, the full error details of failed executions should also be saved, but it looks like that's not happening here.

To Reproduce

Generate any sort of workflow error and then try to figure out where the error is located.

Expected behavior

I should see all types of errors in failed executions -- including out of memory errors.

Operating System

Google Cloud Run

n8n Version

1.30.1

Node.js Version

18.10

Database

PostgreSQL

Execution mode

main (default)

Joffcom commented 6 months ago

Hey @dkindlund,

Looking at the workflow I would say the error occurred on the Airtable node but more information would be needed.

Looking at the output you collected from the error trigger that may also have been in the n8n log it would suggest the issue occurred because of a memory issue this means the node never really got a chance to start.

You are not wrong we really should put this information in the UI somewhere but as it is a workflow level error it wouldn't be right to put it under the node output so we would need to think about how to best display it.

I suspect when the workflow process runs out of memory though it doesn't have the memory to add that to the node which is why it isn't there. We should probably make it clearer as well that the settings are for the workflow itself and not the system in general.

This isn't really a bug but I will keep this open and get a dev ticket created on Monday to look into how we can improve this.

dkindlund commented 6 months ago

Thanks for the analysis, @Joffcom -- I agree it's a hard problem. Just trying to offer a user's perspective about it for now. Thanks!

dkindlund commented 6 months ago

One other point: When I checked Google Cloud Run's memory usage in the single container around the time when this out-of-memory error was reported, I see that only ~15% of the container's memory was actually allocated:

Then, when I checked the logs, I see this sort of activity:

So the timeline of events appear to be:

The container crashed earlier (not sure why it got restarted by Google Cloud Run)
Upon recovery, the new container attempted to recover a the crashed job: Attempting to recover execution 3346
During the recovery of this job, it somehow ran out-of-memory

We're left with a bunch of questions/insights, such as:

1) Why did the container crash to begin with? Looking through the older logs, there were no entries that provide any clues as to why the container crashed.

2) When attempting to recover a crashed workflow, that recovery logic appears to trigger out-of-memory issues even though the container had more than enough memory allocated at the time. (I suspect there might be some sort of out-of-memory bug in n8n's workflow recovery logic that can be hard to pinpoint.)

dkindlund commented 6 months ago

A couple of other data points about this n8n deployment:

It's a single container deployed in Google Cloud Run
Running n8n@1.30.1
Allocated 4 vCPUs and 2GB RAM
EXECUTIONS_MODE=regular
NODE_OPTIONS=--max-old-space-size=1536 (Based on documented recommendations found here.)

dkindlund commented 6 months ago

Oh, this might be a factor:

So essentially, Google Cloud Run can kill/restart the container at any time to run it at a cheaper rate -- so not necessarily because of any sort of n8n error.

I guess the main issue is: n8n's workflow recovery logic doesn't quite work correctly upon container restart -- hence the spurious out of memory errors we're seeing.

Joffcom commented 6 months ago

Ah yeah cpu will cause a similar message, we don’t have container restart logic to restart workflows though that is something that needs to be manually done.

n8n-io / n8n