Celery maybe not working

erigo commented 2 months ago

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.7.0

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Here is the worker log:

[2024-08-23 06:54:15,840: INFO/MainProcess] Connected to redis://:**@redis:6379/1
[2024-08-23 06:54:15,843: INFO/MainProcess] mingle: searching for neighbors
[2024-08-23 06:54:16,856: INFO/MainProcess] mingle: all alone
[2024-08-23 06:54:16,888: INFO/MainProcess] celery@67b80c9f1274 ready.
[2024-08-23 06:54:17,677: INFO/MainProcess] pidbox: Connected to redis://:**@redis:6379/1.
[2024-08-23 06:54:18,106: INFO/MainProcess] Task tasks.batch_create_segment_to_index_task.batch_create_segment_to_index_task[4824e735-7356-4c16-87ab-778b1582b72a] received
[2024-08-23 06:54:18,477: INFO/MainProcess] Start batch create segment jobId: 03a3613a-38a0-4d27-a540-3b6fdac76ce5

no more logs print out. and the dataset still Queuing.

✔️ Expected Behavior

more logs and debug info.

❌ Actual Behavior

no more logs, dataset status: Queuing

crazywoola commented 2 months ago

What's the current setting of this dataset? Like rerank model?

dosubot[bot] commented 2 months ago

The issue with Celery not producing more logs and the dataset status being stuck on "Queuing" in Dify version 0.7.0 when self-hosted with Docker could be due to the log level configuration and the Celery worker settings.

Log Level Configuration: The log level for Celery might be set to a higher threshold, such as INFO, which might not capture all the necessary logs. You can increase the verbosity by setting the LOG_LEVEL to DEBUG.
Celery Worker Configuration: Ensure that the Celery worker settings are properly configured, including the number of workers and auto-scaling settings.

Here is how you can adjust these settings in your docker-compose.yaml:

x-shared-env: &shared-api-worker-env
  LOG_LEVEL: ${LOG_LEVEL:-DEBUG}  # Change INFO to DEBUG to capture more logs
  LOG_FILE: ${LOG_FILE:-}
  DEBUG: ${DEBUG:-true}  # Enable debug mode
  FLASK_DEBUG: ${FLASK_DEBUG:-false}
  SECRET_KEY: ${SECRET_KEY:-sk-9f73s3ljTXVcMT3Blb3ljTqtsKiGHXVcMT3BlbkFJLK7U}
  INIT_PASSWORD: ${INIT_PASSWORD:-}
  CONSOLE_WEB_URL: ${CONSOLE_WEB_URL:-}
  CONSOLE_API_URL: ${CONSOLE_API_URL:-}
  SERVICE_API_URL: ${SERVICE_API_URL:-}
  APP_WEB_URL: ${APP_WEB_URL:-}
  CHECK_UPDATE_URL: ${CHECK_UPDATE_URL:-https://updates.dify.ai}
  OPENAI_API_BASE: ${OPENAI_API_BASE:-https://api.openai.com/v1}
  FILES_URL: ${FILES_URL:-}
  FILES_ACCESS_TIMEOUT: ${FILES_ACCESS_TIMEOUT:-300}
  APP_MAX_ACTIVE_REQUESTS: ${APP_MAX_ACTIVE_REQUESTS:-0}
  MIGRATION_ENABLED: ${MIGRATION_ENABLED:-true}
  DEPLOY_ENV: ${DEPLOY_ENV:-PRODUCTION}
  DIFY_BIND_ADDRESS: ${DIFY_BIND_ADDRESS:-0.0.0.0}
  DIFY_PORT: ${DIFY_PORT:-5001}
  SERVER_WORKER_AMOUNT: ${SERVER_WORKER_AMOUNT:-}
  SERVER_WORKER_CLASS: ${SERVER_WORKER_CLASS:-}
  CELERY_WORKER_CLASS: ${CELERY_WORKER_CLASS:-}
  GUNICORN_TIMEOUT: ${GUNICORN_TIMEOUT:-360}
  CELERY_WORKER_AMOUNT: ${CELERY_WORKER_AMOUNT:-1}  # Ensure this is set appropriately
  CELERY_AUTO_SCALE: ${CELERY_AUTO_SCALE:-true}  # Enable auto-scaling if needed
  CELERY_MAX_WORKERS: ${CELERY_MAX_WORKERS:-10}  # Set max workers
  CELERY_MIN_WORKERS: ${CELERY_MIN_WORKERS:-1}  # Set min workers
  API_TOOL_DEFAULT_CONNECT_TIMEOUT: ${API_TOOL_DEFAULT_CONNECT_TIMEOUT:-10}
  API_TOOL_DEFAULT_READ_TIMEOUT: ${API_TOOL_DEFAULT_READ_TIMEOUT:-60}
  DB_USERNAME: ${DB_USERNAME:-postgres}
  DB_PASSWORD: ${DB_PASSWORD:-difyai123456}
  DB_HOST: ${DB_HOST:-db}
  DB_PORT: ${DB_PORT:-5432}
  DB_DATABASE: ${DB_DATABASE:-dify}
  SQLALCHEMY_POOL_SIZE: ${SQLALCHEMY_POOL_SIZE:-30}
  SQLALCHEMY_POOL_RECYCLE: ${SQLALCHEMY_POOL_RECYCLE:-3600}
  SQLALCHEMY_ECHO: ${SQLALCHEMY_ECHO:-false}
  REDIS_HOST: ${REDIS_HOST:-redis}
  REDIS_PORT: ${REDIS_PORT:-6379}
  REDIS_USERNAME: ${REDIS_USERNAME:-}
  REDIS_PASSWORD: ${REDIS_PASSWORD:-difyai123456}
  REDIS_USE_SSL: ${REDIS_USE_SSL:-false}
  REDIS_DB: 0
  CELERY_BROKER_URL: ${CELERY_BROKER_URL:-redis://:difyai123456@redis:6379/1}

Additionally, ensure that the entrypoint.sh script is correctly setting the Celery worker options. Here is the relevant part of the entrypoint.sh script:

#!/bin/bash

set -e

if [[ "${MIGRATION_ENABLED}" == "true" ]]; then
  echo "Running migrations"
  flask upgrade-db
fi

if [[ "${MODE}" == "worker" ]]; then

  # Get the number of available CPU cores
  if [ "${CELERY_AUTO_SCALE,,}" = "true" ]; then
    # Set MAX_WORKERS to the number of available cores if not specified
    AVAILABLE_CORES=$(nproc)
    MAX_WORKERS=${CELERY_MAX_WORKERS:-$AVAILABLE_CORES}
    MIN_WORKERS=${CELERY_MIN_WORKERS:-1}
    CONCURRENCY_OPTION="--autoscale=${MAX_WORKERS},${MIN_WORKERS}"
  else
    CONCURRENCY_OPTION="-c ${CELERY_WORKER_AMOUNT:-1}"
  fi

  exec celery -A app.celery worker -P ${CELERY_WORKER_CLASS:-gevent} $CONCURRENCY_OPTION --loglevel INFO \
    -Q ${CELERY_QUEUES:-dataset,generation,mail,ops_trace,app_deletion}

elif [[ "${MODE}" == "beat" ]]; then
  exec celery -A app.celery beat --loglevel INFO
else
  if [[ "${DEBUG}" == "true" ]]; then
    exec flask run --host=${DIFY_BIND_ADDRESS:-0.0.0.0} --port=${DIFY_PORT:-5001} --debug
  else
    exec gunicorn \
      --bind "${DIFY_BIND_ADDRESS:-0.0.0.0}:${DIFY_PORT:-5001}" \
      --workers ${SERVER_WORKER_AMOUNT:-1} \
      --worker-class ${SERVER_WORKER_CLASS:-gevent} \
      --timeout ${GUNICORN_TIMEOUT:-200} \
      --preload \
      app:app
  fi
fi

By setting LOG_LEVEL to DEBUG and ensuring that the Celery worker settings are properly configured, you should be able to get more detailed logs and potentially resolve the issue with the dataset status being stuck on "Queuing". Additionally, make sure that the CELERY_WORKER_AMOUNT is set appropriately, as the default is 1, and adjust other related settings as needed ^[1]^[2].

To continue talking to Dosu, mention @dosu.

erigo commented 2 months ago

crazywoola commented 2 months ago

I guess it's the problem between the embedding model and the worker, seems the worker can not connect to the embedding model, please check this embedding model service again.

erigo commented 2 months ago

I have checked, and the worker can connect to the embedding model.

erigo commented 2 months ago

I know why the Celery worker log level is set to INFO. In the entrypoint.sh script, the CLI does not use environment variables.

erigo commented 2 months ago

The issue with Celery not producing more logs and the dataset status being stuck on "Queuing" in Dify version 0.7.0 when self-hosted with Docker could be due to the log level configuration and the Celery worker settings.

Log Level Configuration: The log level for Celery might be set to a higher threshold, such as INFO, which might not capture all the necessary logs. You can increase the verbosity by setting the LOG_LEVEL to DEBUG.
Celery Worker Configuration: Ensure that the Celery worker settings are properly configured, including the number of workers and auto-scaling settings.

Here is how you can adjust these settings in your docker-compose.yaml:

x-shared-env: &shared-api-worker-env
  LOG_LEVEL: ${LOG_LEVEL:-DEBUG}  # Change INFO to DEBUG to capture more logs
  LOG_FILE: ${LOG_FILE:-}
  DEBUG: ${DEBUG:-true}  # Enable debug mode
  FLASK_DEBUG: ${FLASK_DEBUG:-false}
  SECRET_KEY: ${SECRET_KEY:-sk-9f73s3ljTXVcMT3Blb3ljTqtsKiGHXVcMT3BlbkFJLK7U}
  INIT_PASSWORD: ${INIT_PASSWORD:-}
  CONSOLE_WEB_URL: ${CONSOLE_WEB_URL:-}
  CONSOLE_API_URL: ${CONSOLE_API_URL:-}
  SERVICE_API_URL: ${SERVICE_API_URL:-}
  APP_WEB_URL: ${APP_WEB_URL:-}
  CHECK_UPDATE_URL: ${CHECK_UPDATE_URL:-https://updates.dify.ai}
  OPENAI_API_BASE: ${OPENAI_API_BASE:-https://api.openai.com/v1}
  FILES_URL: ${FILES_URL:-}
  FILES_ACCESS_TIMEOUT: ${FILES_ACCESS_TIMEOUT:-300}
  APP_MAX_ACTIVE_REQUESTS: ${APP_MAX_ACTIVE_REQUESTS:-0}
  MIGRATION_ENABLED: ${MIGRATION_ENABLED:-true}
  DEPLOY_ENV: ${DEPLOY_ENV:-PRODUCTION}
  DIFY_BIND_ADDRESS: ${DIFY_BIND_ADDRESS:-0.0.0.0}
  DIFY_PORT: ${DIFY_PORT:-5001}
  SERVER_WORKER_AMOUNT: ${SERVER_WORKER_AMOUNT:-}
  SERVER_WORKER_CLASS: ${SERVER_WORKER_CLASS:-}
  CELERY_WORKER_CLASS: ${CELERY_WORKER_CLASS:-}
  GUNICORN_TIMEOUT: ${GUNICORN_TIMEOUT:-360}
  CELERY_WORKER_AMOUNT: ${CELERY_WORKER_AMOUNT:-1}  # Ensure this is set appropriately
  CELERY_AUTO_SCALE: ${CELERY_AUTO_SCALE:-true}  # Enable auto-scaling if needed
  CELERY_MAX_WORKERS: ${CELERY_MAX_WORKERS:-10}  # Set max workers
  CELERY_MIN_WORKERS: ${CELERY_MIN_WORKERS:-1}  # Set min workers
  API_TOOL_DEFAULT_CONNECT_TIMEOUT: ${API_TOOL_DEFAULT_CONNECT_TIMEOUT:-10}
  API_TOOL_DEFAULT_READ_TIMEOUT: ${API_TOOL_DEFAULT_READ_TIMEOUT:-60}
  DB_USERNAME: ${DB_USERNAME:-postgres}
  DB_PASSWORD: ${DB_PASSWORD:-difyai123456}
  DB_HOST: ${DB_HOST:-db}
  DB_PORT: ${DB_PORT:-5432}
  DB_DATABASE: ${DB_DATABASE:-dify}
  SQLALCHEMY_POOL_SIZE: ${SQLALCHEMY_POOL_SIZE:-30}
  SQLALCHEMY_POOL_RECYCLE: ${SQLALCHEMY_POOL_RECYCLE:-3600}
  SQLALCHEMY_ECHO: ${SQLALCHEMY_ECHO:-false}
  REDIS_HOST: ${REDIS_HOST:-redis}
  REDIS_PORT: ${REDIS_PORT:-6379}
  REDIS_USERNAME: ${REDIS_USERNAME:-}
  REDIS_PASSWORD: ${REDIS_PASSWORD:-difyai123456}
  REDIS_USE_SSL: ${REDIS_USE_SSL:-false}
  REDIS_DB: 0
  CELERY_BROKER_URL: ${CELERY_BROKER_URL:-redis://:difyai123456@redis:6379/1}

Additionally, ensure that the entrypoint.sh script is correctly setting the Celery worker options. Here is the relevant part of the entrypoint.sh script:

#!/bin/bash

set -e

if [[ "${MIGRATION_ENABLED}" == "true" ]]; then
  echo "Running migrations"
  flask upgrade-db
fi

if [[ "${MODE}" == "worker" ]]; then

  # Get the number of available CPU cores
  if [ "${CELERY_AUTO_SCALE,,}" = "true" ]; then
    # Set MAX_WORKERS to the number of available cores if not specified
    AVAILABLE_CORES=$(nproc)
    MAX_WORKERS=${CELERY_MAX_WORKERS:-$AVAILABLE_CORES}
    MIN_WORKERS=${CELERY_MIN_WORKERS:-1}
    CONCURRENCY_OPTION="--autoscale=${MAX_WORKERS},${MIN_WORKERS}"
  else
    CONCURRENCY_OPTION="-c ${CELERY_WORKER_AMOUNT:-1}"
  fi

  exec celery -A app.celery worker -P ${CELERY_WORKER_CLASS:-gevent} $CONCURRENCY_OPTION --loglevel INFO \
    -Q ${CELERY_QUEUES:-dataset,generation,mail,ops_trace,app_deletion}

elif [[ "${MODE}" == "beat" ]]; then
  exec celery -A app.celery beat --loglevel INFO
else
  if [[ "${DEBUG}" == "true" ]]; then
    exec flask run --host=${DIFY_BIND_ADDRESS:-0.0.0.0} --port=${DIFY_PORT:-5001} --debug
  else
    exec gunicorn \
      --bind "${DIFY_BIND_ADDRESS:-0.0.0.0}:${DIFY_PORT:-5001}" \
      --workers ${SERVER_WORKER_AMOUNT:-1} \
      --worker-class ${SERVER_WORKER_CLASS:-gevent} \
      --timeout ${GUNICORN_TIMEOUT:-200} \
      --preload \
      app:app
  fi
fi

By setting LOG_LEVEL to DEBUG and ensuring that the Celery worker settings are properly configured, you should be able to get more detailed logs and potentially resolve the issue with the dataset status being stuck on "Queuing". Additionally, make sure that the CELERY_WORKER_AMOUNT is set appropriately, as the default is 1, and adjust other related settings as needed [1][2].

To continue talking to Dosu, mention @dosu.

Use DEBUG model, my log stop here:

[2024-08-23 11:37:33,038: INFO/MainProcess] Task tasks.batch_create_segment_to_index_task.batch_create_segment_to_index_task[d292c9c5-1a9a-4eaf-be96-f79ecf28aaeb] received
[2024-08-23 11:37:33,039: DEBUG/MainProcess] TaskPool: Apply <function fast_trace_task at 0x7f50f3adbf40> (args:('tasks.batch_create_segment_to_index_task.batch_create_segment_to_index_task', 'd292c9c5-1a9a-4eaf-be96-f79ecf28aaeb', {'lang': 'py', 'task': 'tasks.batch_create_segment_to_index_task.batch_create_segment_to_index_task', 'id': 'd292c9c5-1a9a-4eaf-be96-f79ecf28aaeb', 'shadow': None, 'eta': None, 'expires': None, 'group': None, 'group_index': None, 'retries': 0, 'timelimit': [None, None], 'root_id': 'd292c9c5-1a9a-4eaf-be96-f79ecf28aaeb', 'parent_id': None, 'argsrepr': "('43edd1f8-df98-4ec6-9960-b5680537e63f', [{'content': '问题 1', 'answer': '答案 1'}, {'content': '问题 2', 'answer': '答案 2'}], '6779c07c-9deb-4200-89e0-3f7a2b1a0519', 'a1a7c357-acfb-484d-a365-30262bafb03f', 'a7dc268e-6e62-4af9-9b2f-f32458108895', 'af32597e-a4f0-45fe-9f96-b924dd8ab1e5')", 'kwargsrepr': '{}', 'origin': 'gen164@46db059089ed', 'ignore_result': True, 'replaced_task_nesting': 0, 'stamped_headers': None, 'stamps': {}, 'properties': {'correlation_id': 'd292c9c5-1a9a-4eaf-be96-f79ecf28aaeb', 'reply_to':... kwargs:{})

erigo commented 2 months ago

I have try to use prefork instead of gevent，it's ok. my cli is:

celery -A app.celery worker -P prefork --concurrency 1 --loglevel DEBUG -Q dataset,generation,mail,ops_trace,app_deletion --without-gossip --without-mingle

when use gevent, the worker stopped at:

[2024-08-23 12:13:03,540: DEBUG/MainProcess] Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
[2024-08-23 12:13:03,540: DEBUG/MainProcess] Loading model from cache /tmp/jieba.cache
Loading model cost 1.036 seconds.
[2024-08-23 12:13:04,576: DEBUG/MainProcess] Loading model cost 1.036 seconds.
Prefix dict has been built successfully.
[2024-08-23 12:13:04,576: DEBUG/MainProcess] Prefix dict has been built successfully.

the right process is:

Loading model cost 1.144 seconds.
[2024-08-23 12:11:01,774: DEBUG/ForkPoolWorker-1] Loading model cost 1.144 seconds.
Prefix dict has been built successfully.
[2024-08-23 12:11:01,775: DEBUG/ForkPoolWorker-1] Prefix dict has been built successfully.
[2024-08-23 12:11:01,843: DEBUG/ForkPoolWorker-1] Created new connection using: 160e3511b3bc4b78b107b362d744816f
[2024-08-23 12:11:01,888: INFO/ForkPoolWorker-1] Processed dataset: c8575896-61e0-4222-add0-99b35db85f56 latency: 2.9650166537612677
[2024-08-23 12:11:01,888: INFO/ForkPoolWorker-1] Task tasks.document_indexing_task.document_indexing_task[3bedb474-26c0-4ba8-9674-c37e0086edfb] succeeded in 2.9661346543580294s: None

erigo commented 2 months ago

I have another PR need merge. so I think the issue is:

# /api/extensions/ext_celery.py

use monkey patch to fix it.

from gevent import monkey
monkey.patch_all()

langgenius / dify