langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
44.41k stars 6.22k forks source link

Optimization Suggestions for Database Connection Release Issues #6589

Open secbr opened 1 month ago

secbr commented 1 month ago

Self Checks

1. Is this request related to a challenge you're experiencing? Tell me about your story.

The current project uses the database framework flask_sqlalchemy.

flask_sqlalchemy binds a database connection for each HTTP request. In a high-concurrency production environment, scenarios involving streaming returns or prolonged LLM returns occupy database connections for an extended period. For instance, an LLM request occupying more than 10 seconds would mean the entire database connection is engaged for over 10 seconds.

Eventually, this leads to a continuous increase in the number of database connections under high concurrency, until it hits the limit and exceptions occur.

Modification suggestions: (1) Decouple the database connection from HTTP requests. Release the connection immediately after querying or modifying data within an HTTP request. Re-acquire the connection for database operations as needed and release it immediately after the operation, instead of making long-duration blocking LLM requests while holding onto the database connection. (2) One of the best practices in programming is also to avoid making HTTP requests within a database connection or transaction, especially for those that take a long time.

2. Additional context or comments

No response

3. Can you help us with this feature?

hjlarry commented 1 month ago

the document here described how they use db.session, I think they already considered this high-concurrency scene.

secbr commented 1 month ago

the document here described how they use db.session, I think they already considered this high-concurrency scene.

In theory, this is the case, but under high concurrency, there have already been issues of database connections being exhausted due to the prolonged response times of the LLM model.

secbr commented 1 month ago

Additional information:

Exception

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/app/api/.venv/lib/python3.10/site-packages/flask/app.py", line 880, in full_dispatch_request rv = self.dispatch_request() File "/app/api/.venv/lib/python3.10/site-packages/flask/app.py", line 865, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(view_args) # type: ignore[no-any-return] File "/app/api/.venv/lib/python3.10/site-packages/flask_restful/init.py", line 489, in wrapper resp = resource(args, kwargs) File "/app/api/.venv/lib/python3.10/site-packages/flask/views.py", line 110, in view return current_app.ensure_sync(self.dispatch_request)(kwargs) # type: ignore[no-any-return] File "/app/api/.venv/lib/python3.10/site-packages/flask_restful/init.py", line 604, in dispatch_request resp = meth(args, kwargs) File "/app/api/controllers/console/setup.py", line 71, in decorated elif not get_setup_status(): File "/app/api/controllers/console/setup.py", line 81, in get_setup_status return DifySetup.query.first() File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2728, in first return self.limit(1)._iter().first() # type: ignore File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2827, in _iter result: Union[ScalarResult[_T], Result[_T]] = self.session.execute( File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 2351, in execute return self._execute_internal( File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 2226, in _execute_internal conn = self._connection_for_bind(bind) File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 2095, in _connection_for_bind return trans._connection_for_bind(engine, execution_options) File "", line 2, in _connection_for_bind File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/orm/state_changes.py", line 139, in _go ret_value = fn(self, *arg, kw) File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1189, in _connection_for_bind conn = bind.connect() File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3276, in connect return self._connection_cls(self) File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 148, in init Connection._handle_dbapi_exception_noconnection( File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2440, in _handle_dbapi_exception_noconnection raise sqlalchemy_exception.with_traceback(exc_info[2]) from e File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 146, in init self._dbapi_connection = engine.raw_connection() File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3300, in raw_connection return self.pool.connect() File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 449, in connect return _ConnectionFairy._checkout(self) File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1263, in _checkout fairy = _ConnectionRecord.checkout(pool) File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 717, in checkout with util.safe_reraise(): File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 146, in exit raise exc_value.with_traceback(exc_tb) File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 715, in checkout dbapi_connection = rec.get_connection() File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 866, in get_connection self.connect() File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 900, in connect with util.safe_reraise(): File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 146, in exit raise exc_value.with_traceback(exc_tb) File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 896, in __connect self.dbapi_connection = connection = pool._invoke_creator(self) File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 643, in connect return dialect.connect(*cargs, *cparams) File "/app/api/.venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 620, in connect return self.loaded_dbapi.connect(cargs, cparams) File "/app/api/.venv/lib/python3.10/site-packages/psycopg2/init.py", line 122, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "10.1.0.11", port 5432 failed: FATAL: sorry, too many clients already

(Background on this error at: https://sqlalche.me/e/20/e3q8) 2024-07-23 01:55:30,415.415 ERROR [Dummy-3590] [app.py:838] - Exception on /console/api/datasets [GET] Traceback (most recent call last): File "/app/api/.venv/lib/python3.10/site-packages/flask/app.py", line 880, in full_dispatch_request rv = self.dispatch_request() File "/app/api/.venv/lib/python3.10/site-packages/flask/app.py", line 865, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(view_args) # type: ignore[no-any-return] File "/app/api/.venv/lib/python3.10/site-packages/flask_restful/init.py", line 489, in wrapper resp = resource(args, kwargs) File "/app/api/.venv/lib/python3.10/site-packages/flask/views.py", line 110, in view return current_app.ensure_sync(self.dispatch_request)(kwargs) # type: ignore[no-any-return] File "/app/api/.venv/lib/python3.10/site-packages/flask_restful/init.py", line 604, in dispatch_request resp = meth(args, kwargs) File "/app/api/controllers/console/setup.py", line 74, in decorated return view(*args, kwargs) File "/app/api/libs/login.py", line 91, in decorated_view return current_app.ensure_sync(func)(*args, *kwargs) File "/app/api/controllers/console/wraps.py", line 21, in decorated return view(args, kwargs) File "/app/api/controllers/console/datasets/datasets.py", line 64, in get configurations = provider_manager.get_configurations( File "/app/api/core/provider_manager.py", line 125, in get_configurations custom_configuration = self._to_custom_configuration( File "/app/api/core/provider_manager.py", line 607, in _to_custom_configuration self.decoding_rsa_key, self.decoding_cipher_rsa = encrypter.get_decrypt_decoding(tenant_id) File "/app/api/core/helper/encrypter.py", line 29, in get_decrypt_decoding return rsa.get_decrypt_decoding(tenant_id) File "/app/api/libs/rsa.py", line 55, in get_decrypt_decoding private_key = storage.load(filepath) File "/app/api/extensions/ext_storage.py", line 55, in load return self.load_once(filename) File "/app/api/extensions/ext_storage.py", line 58, in load_once return self.storage_runner.load_once(filename) File "/app/api/extensions/storage/s3_storage.py", line 38, in load_once data = client.get_object(Bucket=self.bucket_name, Key=filename)['Body'].read() File "/app/api/.venv/lib/python3.10/site-packages/botocore/client.py", line 565, in _api_call return self._make_api_call(operation_name, kwargs) File "/app/api/.venv/lib/python3.10/site-packages/botocore/client.py", line 1021, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied. 2024-07-23 01:55:44,723.723 ERROR [Dummy-642] [completion.py:80] - internal server error. Traceback (most recent call last): File "/app/api/controllers/service_api/app/completion.py", line 53, in post response = AppGenerateService.generate( File "/app/api/services/app_generate_service.py", line 40, in generate return rate_limit.generate(CompletionAppGenerator().generate( File "/app/api/core/app/apps/completion/app_generator.py", line 105, in generate model_conf=ModelConfigConverter.convert(app_config), File "/app/api/core/app/app_config/easy_ui_based_app/model_config/converter.py", line 27, in convert provider_model_bundle = provider_manager.get_provider_model_bundle( File "/app/api/core/provider_manager.py", line 200, in get_provider_model_bundle provider_configurations = self.get_configurations(tenant_id) File "/app/api/core/provider_manager.py", line 125, in get_configurations custom_configuration = self._to_custom_configuration( File "/app/api/core/provider_manager.py", line 607, in _to_custom_configuration self.decoding_rsa_key, self.decoding_cipher_rsa = encrypter.get_decrypt_decoding(tenant_id) File "/app/api/core/helper/encrypter.py", line 29, in get_decrypt_decoding return rsa.get_decrypt_decoding(tenant_id) File "/app/api/libs/rsa.py", line 55, in get_decrypt_decoding private_key = storage.load(filepath) File "/app/api/extensions/ext_storage.py", line 55, in load return self.load_once(filename) File "/app/api/extensions/ext_storage.py", line 58, in load_once return self.storage_runner.load_once(filename) File "/app/api/extensions/storage/s3_storage.py", line 38, in load_once data = client.get_object(Bucket=self.bucket_name, Key=filename)['Body'].read() File "/app/api/.venv/lib/python3.10/site-packages/botocore/client.py", line 565, in _api_call return self._make_api_call(operation_name, kwargs) File "/app/api/.venv/lib/python3.10/site-packages/botocore/client.py", line 1021, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied. 2024-07-23 01:55:44,724.724 ERROR [Dummy-642] [app.py:838] - Exception on /v1/completion-messages [POST] Traceback (most recent call last): File "/app/api/controllers/service_api/app/completion.py", line 53, in post response = AppGenerateService.generate( File "/app/api/services/app_generate_service.py", line 40, in generate return rate_limit.generate(CompletionAppGenerator().generate( File "/app/api/core/app/apps/completion/app_generator.py", line 105, in generate model_conf=ModelConfigConverter.convert(app_config), File "/app/api/core/app/app_config/easy_ui_based_app/model_config/converter.py", line 27, in convert provider_model_bundle = provider_manager.get_provider_model_bundle( File "/app/api/core/provider_manager.py", line 200, in get_provider_model_bundle provider_configurations = self.get_configurations(tenant_id) File "/app/api/core/provider_manager.py", line 125, in get_configurations custom_configuration = self._to_custom_configuration( File "/app/api/core/provider_manager.py", line 607, in _to_custom_configuration self.decoding_rsa_key, self.decoding_cipher_rsa = encrypter.get_decrypt_decoding(tenant_id) File "/app/api/core/helper/encrypter.py", line 29, in get_decrypt_decoding return rsa.get_decrypt_decoding(tenant_id) File "/app/api/libs/rsa.py", line 55, in get_decrypt_decoding private_key = storage.load(filepath) File "/app/api/extensions/ext_storage.py", line 55, in load return self.load_once(filename) File "/app/api/extensions/ext_storage.py", line 58, in load_once return self.storage_runner.load_once(filename) File "/app/api/extensions/storage/s3_storage.py", line 38, in load_once data = client.get_object(Bucket=self.bucket_name, Key=filename)['Body'].read() File "/app/api/.venv/lib/python3.10/site-packages/botocore/client.py", line 565, in _api_call return self._make_api_call(operation_name, kwargs) File "/app/api/.venv/lib/python3.10/site-packages/botocore/client.py", line 1021, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/app/api/.venv/lib/python3.10/site-packages/flask/app.py", line 880, in full_dispatch_request rv = self.dispatch_request() File "/app/api/.venv/lib/python3.10/site-packages/flask/app.py", line 865, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(view_args) # type: ignore[no-any-return] File "/app/api/.venv/lib/python3.10/site-packages/flask_restful/init.py", line 489, in wrapper resp = resource(args, kwargs) File "/app/api/.venv/lib/python3.10/site-packages/flask/views.py", line 110, in view return current_app.ensure_sync(self.dispatch_request)(kwargs) # type: ignore[no-any-return] File "/app/api/.venv/lib/python3.10/site-packages/flask_restful/init.py", line 604, in dispatch_request resp = meth(args, kwargs) File "/app/api/controllers/service_api/wraps.py", line 75, in decorated_view return view_func(*args, **kwargs) File "/app/api/controllers/service_api/app/completion.py", line 81, in post raise InternalServerError() werkzeug.exceptions.InternalServerError: 500 Internal Server Error: The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.

Some Thoughts

I think the problematic code might be in the following code snippet. Because the db.session.close() method is executed in the finally block, it means the close method is only executed after the entire program has finished running. This results in operations that involve long-duration LLM requests occurring between the creation and closing of the session.

def _generate_worker(self, flask_app: Flask,
                         application_generate_entity: CompletionAppGenerateEntity,
                         queue_manager: AppQueueManager,
                         message_id: str) -> None:
        """
        Generate worker in a new thread.
        :param flask_app: Flask app
        :param application_generate_entity: application generate entity
        :param queue_manager: queue manager
        :param message_id: message ID
        :return:
        """
        with flask_app.app_context():
            try:
                # get message
                message = self._get_message(message_id)

                # chatbot app
                runner = CompletionAppRunner()
                runner.run(
                    application_generate_entity=application_generate_entity,
                    queue_manager=queue_manager,
                    message=message
                )
            except GenerateTaskStoppedException:
                pass
            except InvokeAuthorizationError:
                queue_manager.publish_error(
                    InvokeAuthorizationError('Incorrect API key provided'),
                    PublishFrom.APPLICATION_MANAGER
                )
            except ValidationError as e:
                logger.exception("Validation Error when generating")
                queue_manager.publish_error(e, PublishFrom.APPLICATION_MANAGER)
            except (ValueError, InvokeError) as e:
                if os.environ.get("DEBUG") and os.environ.get("DEBUG").lower() == 'true':
                    logger.exception("Error when generating")
                queue_manager.publish_error(e, PublishFrom.APPLICATION_MANAGER)
            except Exception as e:
                logger.exception("Unknown Error when generating")
                queue_manager.publish_error(e, PublishFrom.APPLICATION_MANAGER)
            finally:
                db.session.close()
hjlarry commented 1 month ago

although the finally block has db.session.close, the code here inside runner.run() before invoke_llm() also has a db.session.close

maybe the error is caused by PostgreSQL's max_connections config?

ruzzzz6312 commented 3 weeks ago

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "10.1.0.11", port 5432 failed: FATAL: sorry, too many clients already. I have a complex workflow and I am the only user.

getting this error very frequently, is there any config to tweak?

TY

ChiayenGu commented 2 days ago

I also encountered this issue. By mapping the PostgreSQL database configuration file, modifying the maximum number of connections, expanding performance settings, and adjusting the connection pool size in the env file, the problem no longer occurs. However, I don't think this is the best solution.