Description:
When accessing the Execution tab in Kubeflow Pipelines, the default Main tab loads fine. However, when switching to the Grouped tab, the UI takes a while to load and then results in the following error:
Steps
Navigate to the Executions tab in the Kubeflow Pipelines UI.
Switch from the default tab to the Grouped tab.
The page attempts to load and eventually fails with the above error.
Expected result
The Grouped tab should load executions specific to the selected profile and not fail with the Gateway Time-out error.
The system should only fetch executions for the current profile, rather than fetching executions for all profiles (namespaces).
Actual Result
The page results in failure with below error message
Error: Failed getting executions: Unknown Content-type received. Code: 2
Logs from the metadata-grpc-deployment pod show the following error:
W1021 10:05:33.342247 210 metadata_store_service_impl.cc:417] PutExecution failed: mysql_query aborted: errno: Lock wait timeout exceeded; try restarting transaction, error: Lock wait timeout exceeded; try restarting transaction
Executions Fetched Across All Profiles:
The system appears to fetch executions from all Kubeflow profiles (i.e., namespaces) regardless of the currently selected profile in the UI. This results in fetching executions across multiple namespaces, which might be contributing to the slowness.
Additional Context:
It seems that the large number of pipeline runs (~100k) may be contributing to the slow query times or query timeout in MySQL, resulting in the error.
The Lock wait timeout error from the MySQL database in the metadata-grpc-deployment pod could indicate a need for query optimisation or database tuning to handle the load more efficiently.
Pipeline Runs: Approximately 100k pipeline runs in the system
Environment
Steps to reproduce
Description: When accessing the Execution tab in Kubeflow Pipelines, the default Main tab loads fine. However, when switching to the Grouped tab, the UI takes a while to load and then results in the following error:
Steps
Expected result
Actual Result
The page results in failure with below error message
Error: Failed getting executions: Unknown Content-type received. Code: 2
Materials and reference
Debugging Findings:
Network Call Failure:
/ml_metadata.MetadataStoreService/GetExecutions
Gateway Time-out
Pod Logs (metadata-grpc-deployment):
metadata-grpc-deployment
pod show the following error:W1021 10:05:33.342247 210 metadata_store_service_impl.cc:417] PutExecution failed: mysql_query aborted: errno: Lock wait timeout exceeded; try restarting transaction, error: Lock wait timeout exceeded; try restarting transaction
Executions Fetched Across All Profiles:
Additional Context:
Impacted by this bug? Give it a 👍.