n8n-io / n8n

Free and source-available fair-code licensed workflow automation tool. Easily automate tasks across different services.
https://n8n.io
Other
47.16k stars 7k forks source link

Running execusions cannot be stopped #8354

Open prononext opened 9 months ago

prononext commented 9 months ago

On n8n docker in queue mode version: 1.22.6

I see various execusions which cannot be stopped.

image

when hitting "Stop Executions" the confirmation is there but the execusions are still running.

Also under the main execusions menu I see several execusions running and when switching to the some of them, the running ones only light up for a second, then dissapear (sadly that I could not document via video as its too fast)

image

I got this ENVs on my docker stack. Maybe someone could help if they are still ok for a high performance instance with workflows sometimes having to run about 60 minutes more or less:

DB_INIT_FILE=/opt/n8n/init-data.sh
N8N_LOCAL_STORAGE=/local-files
POSTGRESQL_VERSION_TAG=12.13
REDIS_VERSION_TAG=alpine
N8N_VERSION_TAG=1.22.6
N8N_MAIN_COMMAND=start
N8N_WORKER_COMMAND=worker --concurrency=10
N8N_WEBHOOK_COMMAND=webhook
N8N_PORT=5678
N8N_USER_MANAGEMENT_DISABLED=false
N8N_BASIC_AUTH_ACTIVE=true
N8N_DIAGNOSTICS_ENABLED=false
N8N_PERSONALIZATION_ENABLED=false
N8N_HIRING_BANNER_ENABLED=false
N8N_LOG_LEVEL=debug
N8N_DISABLE_PRODUCTION_MAIN_PROCESS=true
EXECUTIONS_MODE=queue
EXECUTIONS_DATA_SAVE_ON_SUCCESS=none
EXECUTIONS_DATA_SAVE_ON_ERROR=all
EXECUTIONS_DATA_PRUNE=true
EXECUTIONS_DATA_MAX_AGE=32
N8N_DEFAULT_BINARY_DATA_MODE=filesystem
N8N_AVAILABLE_BINARY_DATA_MODES=filesystem
DB_TYPE=postgresdb
DB_POSTGRESDB_PORT=5432
DB_POSTGRESDB_HOST=postgres
DB_LOGGING_MAX_EXECUTION_TIME=0
QUEUE_BULL_REDIS_PORT=6379
QUEUE_BULL_REDIS_HOST=redis
QUEUE_HEALTH_CHECK_ACTIVE=true
N8N_GRACEFUL_SHUTDOWN_TIMEOUT=600
QUEUE_WORKER_LOCK_DURATION=180000
QUEUE_WORKER_MAX_STALLED_COUNT=5
QUEUE_RECOVERY_INTERVAL=300
QUEUE_BULL_REDIS_TIMEOUT_THRESHOLD=180000
N8N_SKIP_WEBHOOK_DEREGISTRATION_SHUTDOWN=true
N8N_ENDPOINT_WEBHOOK=prod
NODE_OPTIONS="--max-old-space-size=8000"
NODE_FUNCTION_ALLOW_BUILTIN=*
NODE_FUNCTION_ALLOW_EXTERNAL=*
Joffcom commented 9 months ago

Hey @prononext,

Do you get the same issue in 1.24.1?

prononext commented 9 months ago

is it save to update to the next version on a production environment. I have downgraded to 1.22.1 and really strange things are happening like:

running into major problems like this with n8n every 1-2 month is really killing the project for me sadly

Joffcom commented 9 months ago

Hey @prononext,

The next‘ version will be markedlatest` later today so it should be ok but as with any software that is being used in a production environment I would recommend running a test environment so you can test to make sure your flows don't do anything unexpected.

While we don't set out to break things sadly like with any application the odd issue does slip through.

It sounds like the executions can't be stopped might not be linked to the version but I am fairly sure we fixed something with them recently, I will dig through the release notes to see if I can find anything.

dkindlund commented 7 months ago

Hey @Joffcom , this might be a regression but I'm encountering the same issue on n8n@1.30.1 as well. Specifically, when a scheduled workflow is running, and I try to manually stop it, it doesn't actually stop the execution (keeps running). I'm not sure how to debug or troubleshoot this further.

Joffcom commented 7 months ago

@dkindlund when you press stop do you see an error? Are you also running in queue mode?

dkindlund commented 7 months ago

Hey @Joffcom , when I press stop, the UI briefly changes the workflow to a "stopped" state, but then during the next auto-refresh, it goes back to "running". I'm not running in queue mode (it's the main/standalone/integrated mode). The only way to fix this issue is by restarting the container altogether.

To be honest, it feels like some sort of DB record conflict, perhaps? Like, when I press "stop" on the workflow execution, I think it first updates the workflow state information inside the PostgreSQL DB and then the thread responsible for running the job is supposed to "check" the state table in the DB -- but it never does -- instead, the thread responsible for running the job ends up just updating the DB entry again.

Joffcom commented 7 months ago

Hey @dkindlund,

Do you also have this issue when you connect to your n8n instance directly without using any kind of reverse proxy or load balancer? I have just checked our internal install, my cloud instance and my home instance and I am not able to reproduce this.

dkindlund commented 7 months ago

Hey @Joffcom , that's a good question. I don't have it running locally -- it's deployed as a Google Cloud Run container. It's currently configured to spin up between 1 and 3 instances (autoscaling). Most of the time, a single instance is running. In the network section, I do have Session affinity checked, so that way the load balancer keeps state.

Google Cloud Run services are just a simplified wrapper on top of k8s, and here's the underlying YAML file that's generated for this deployment:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: n8n
  namespace: 'XXXREDACTEDXXX'
  selfLink: /apis/serving.knative.dev/v1/namespaces/XXXREDACTEDXXX/services/n8n
  uid: XXXREDACTEDXXX
  resourceVersion: XXXREDACTEDXXX
  generation: 38
  creationTimestamp: '2024-02-09T02:00:07.415038Z'
  labels:
    owner: darien
    managed-by: gcp-cloud-build-deploy-cloud-run
    purpose: n8n
    gcb-trigger-id: XXXREDACTEDXXX
    gcb-trigger-region: global
    commit-sha: XXXREDACTEDXXX
    gcb-build-id: XXXREDACTEDXXX
    cloud.googleapis.com/location: us-west1
  annotations:
    run.googleapis.com/client-name: cloud-console
    serving.knative.dev/creator: darien@fletch.ai
    serving.knative.dev/lastModifier: darien@fletch.ai
    run.googleapis.com/launch-stage: BETA
    run.googleapis.com/operation-id: XXXREDACTEDXXX
    run.googleapis.com/ingress: all
    run.googleapis.com/ingress-status: all
spec:
  template:
    metadata:
      labels:
        owner: darien
        client.knative.dev/nonce: XXXREDACTEDXXX
        managed-by: gcp-cloud-build-deploy-cloud-run
        purpose: n8n
        gcb-trigger-id: XXXREDACTEDXXX
        gcb-trigger-region: global
        commit-sha: XXXREDACTEDXXX
        gcb-build-id: XXXREDACTEDXXX
        run.googleapis.com/startupProbeType: Custom
      annotations:
        run.googleapis.com/client-name: cloud-console
        run.googleapis.com/network-interfaces: '[{"network":"default","subnetwork":"default","tags":["dev-n8n"]}]'
        run.googleapis.com/sessionAffinity: 'true'
        autoscaling.knative.dev/minScale: '1'
        run.googleapis.com/vpc-access-egress: private-ranges-only
        run.googleapis.com/execution-environment: gen2
        autoscaling.knative.dev/maxScale: '3'
        run.googleapis.com/startup-cpu-boost: 'true'
    spec:
      containerConcurrency: 80
      timeoutSeconds: 3600
      serviceAccountName: XXXREDACTEDXXX
      containers:
      - name: main
        image: us-west1-docker.pkg.dev/XXXREDACTEDXXX/cloud-run-source-deploy/n8n/n8n:XXXREDACTEDXXX
        ports:
        - name: http1
          containerPort: 5678
        env:
        - name: N8N_VERSION
          value: latest
        - name: DB_POSTGRESDB_DATABASE
          value: dev-n8n-conf
        - name: DB_POSTGRESDB_HOST
          value: XXXREDACTEDXXX
        - name: DB_POSTGRESDB_USER
          value: dev-n8n-conf
        - name: DB_POSTGRESDB_PORT
          value: '5432'
        - name: DB_POSTGRESDB_SCHEMA
          value: public
        - name: DB_POSTGRESDB_SSL_REJECT_UNAUTHORIZED
          value: 'false'
        - name: DB_POSTGRESDB_SSL_CA
          value: XXXREDACTEDXXX
        - name: DB_TYPE
          value: postgresdb
        - name: N8N_USER_FOLDER
          value: /opt/n8n
        - name: WEBHOOK_URL
          value: XXXREDACTEDXXX
        - name: GENERIC_TIMEZONE
          value: America/Los_Angeles
        - name: EXECUTIONS_TIMEOUT
          value: '2700'
        - name: N8N_EDITOR_BASE_URL
          value: XXXREDACTEDXXX
        - name: N8N_HOST
          value: XXXREDACTEDXXX
        - name: N8N_HIRING_BANNER_ENABLED
          value: 'false'
        - name: N8N_SMTP_HOST
          value: XXXREDACTEDXXX
        - name: N8N_SMTP_PORT
          value: '465'
        - name: N8N_SMTP_USER
          value: XXXREDACTEDXXX
        - name: N8N_SMTP_SENDER
          value: XXXREDACTEDXXX
        - name: N8N_LOG_LEVEL
          value: info
        - name: EXECUTIONS_MODE
          value: regular
        - name: N8N_DISABLE_PRODUCTION_MAIN_PROCESS
          value: 'false'
        - name: EXECUTIONS_TIMEOUT_MAX
          value: '2700'
        - name: EXECUTIONS_DATA_PRUNE
          value: 'true'
        - name: EXECUTIONS_DATA_MAX_AGE
          value: '168'
        - name: EXECUTIONS_DATA_PRUNE_MAX_COUNT
          value: '50000'
        - name: NODE_OPTIONS
          value: --max-old-space-size=1536
        - name: N8N_PUSH_BACKEND
          value: websocket
        - name: N8N_DEFAULT_BINARY_DATA_MODE
          value: filesystem
        - name: N8N_ENCRYPTION_KEY
          valueFrom:
            secretKeyRef:
              key: latest
              name: dev-n8n_secretkey
        - name: DB_POSTGRESDB_PASSWORD
          valueFrom:
            secretKeyRef:
              key: latest
              name: threat-intel-context_dev-n8n-conf_password
        - name: N8N_SMTP_PASS
          valueFrom:
            secretKeyRef:
              key: latest
              name: dev-n8n_mailjet_secretkey
        resources:
          limits:
            cpu: 2000m
            memory: 2Gi
        volumeMounts:
        - name: dev-n8n
          mountPath: /opt/n8n
        startupProbe:
          initialDelaySeconds: 60
          timeoutSeconds: 45
          periodSeconds: 60
          failureThreshold: 10
          tcpSocket:
            port: 5678
      volumes:
      - name: dev-n8n
        csi:
          driver: gcsfuse.run.googleapis.com
          volumeAttributes:
            bucketName: dev-n8n
  traffic:
  - percent: 100
    latestRevision: true
status:
  observedGeneration: 38
  conditions:
  - type: Ready
    status: 'True'
    lastTransitionTime: '2024-02-28T19:57:58.756674Z'
  - type: ConfigurationsReady
    status: 'True'
    lastTransitionTime: '2024-02-09T02:00:07.521375Z'
  - type: RoutesReady
    status: 'True'
    lastTransitionTime: '2024-02-28T19:57:58.711992Z'
  latestReadyRevisionName: XXXREDACTEDXXX
  latestCreatedRevisionName: XXXREDACTEDXXX
  traffic:
  - revisionName: XXXREDACTEDXXX
    percent: 100
    latestRevision: true
  url: XXXREDACTEDXXX
  address:
    url: XXXREDACTEDXXX
dkindlund commented 7 months ago

Hey @Joffcom , it occurred to me that I never got a reply back to my original architecture question posted in the community forum about this issue: https://community.n8n.io/t/n8n-architecture-questions/40375

Specifically, is it possible that EXECUTIONS_MODE=regular was never designed to support more than one simultaneous instance, and that the problem of "stopping running executions" is actually a symptom of an inadvertent split-brain problem?

Like, if let's say 2 or more n8n instances both running EXECUTIONS_MODE=regular and talking to the same database... if one instance is running the job... and in the other instance was processing the user's request to "stop" the job in the UI... maybe the code was never designed for this in mind?

Joffcom commented 6 months ago

Hey @dkindlund,

I missed your reply on this one, You are correct regular as documented is not intended for multiple instances of n8n so you would use regular if you have one instance and queue if you are running in queue mode on all instances.

If you were to have 2 main instances talking to the same database I would expect there to be issues but it would also raise more questions like why was it deployed that way.

We do now support multiple main instances in queue mode but even then it still needs to be be in queue mode.

AiratHalitov commented 1 month ago

Issues is still actual in n8n v1.54.4

workflow has timeout 10 minutes: image

but it does not stop image

I have many similar examples where processes do not stop. This is just one of them.