machine-learning-exchange / mlx

Machine Learning eXchange (MLX). Data and AI Assets Catalog and Execution Engine
https://ml-exchange.org/
Apache License 2.0
205 stars 54 forks source link

Quickstart pipeline API having problems with stress tests #129

Closed Tomcli closed 3 years ago

Tomcli commented 3 years ago

Describe the bug

@yhwang can you describe the errors that you found?

To Reproduce

Steps to reproduce the behavior:

  1. Deploy read only MLX at https://github.com/machine-learning-exchange/mlx/pull/126
  2. Run stress tests against the MLX pipelines API

Expected behavior

A clear and concise description of what you expected to happen.

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

Additional context

Add any other context about the problem here.

yhwang commented 3 years ago

Here is the error message:

Traceback (most recent call last):
  File "/usr/src/app/swagger_server/util.py", line 259, in invoke_controller_impl
    results = impl_func(**parameters)
  File "/usr/src/app/swagger_server/controllers_impl/pipeline_service_controller_impl.py", line 194, in list_pipelines
    api_pipelines: [ApiPipeline] = load_data(ApiPipelineExtended, filter_dict=filter_dict, sort_by=sort_by,
  File "/usr/src/app/swagger_server/data_access/mysql_client.py", line 678, in load_data
    _verify_or_create_table(table_name, swagger_class)
  File "/usr/src/app/swagger_server/data_access/mysql_client.py", line 359, in _verify_or_create_table
    _validate_schema(table_name, swagger_class)
  File "/usr/src/app/swagger_server/data_access/mysql_client.py", line 440, in _validate_schema
    raise ApiError(err_msg)
swagger_server.util.ApiError: The MySQL table 'mlpipeline.pipelines_extended' does not match Swagger class 'ApiPipelineExtended'.
 Found table with columns:
  - 'UUID' varchar(255)
  - 'CreatedAtInSec' bigint(20)
  - 'Name' varchar(255)
  - 'Description' varchar(255)
  - 'Parameters' longtext
  - 'Status' varchar(255)
  - 'DefaultVersionId' varchar(255)
  - 'Namespace' varchar(255)
  - 'Annotations' longtext
  - 'Featured' tinyint(1)
  - 'PublishApproved' tinyint(1).
 Expected table with columns:
  - 'UUID' varchar(255)
  - 'CreatedAtInSec' bigint(20)
  - 'Name' varchar(255)
  - 'Description' longtext
  - 'Parameters' longtext
  - 'Status' varchar(255)
  - 'DefaultVersionId' varchar(255)
  - 'Namespace' varchar(63)
  - 'Annotations' longtext
  - 'Featured' tinyint(1)
  - 'PublishApproved' tinyint(1).
 Delete and recreate the table by calling the API endpoint 'DELETE /pipelines_extended/*' (500)

After importing the quickstart catalog, the pipelines url is good. I can see all pipeline cards. The stress test is sending requests to get 2 of the pipeline cards repeatedly. After I ran the test for a while, the /apis/v1alpha1/pipelines api started to sending back 500: internal server error and I saw the error message above in the mlx-api pod. I always start with 1 pod for mlx-api, after importing the quickstart catalog, I scale up to 3 or more pods. Not sure if this is related to the issue.

ckadner commented 3 years ago

Could there have been some pods that crashed? There is a code path in the MLX API that creates the pipelines table if it does not exists. That code path was never used since we always find the pipelines table already created by KFP or by the init_db.sh script I wrote for the quickstart with Docker Compose.

Tomcli commented 3 years ago

@ckadner when I rerun the init_db.sh job, the tables are recreated and everything works fine. But once we ran the stress test again, then the above error will pop up.

ckadner commented 3 years ago

@ckadner when I rerun the init_db.sh job, the tables are recreated and everything works fine. But once we ran the stress test again, then the above error will pop up.

that seems to indicate that the MLX API pod does not find the pipelines table and creates it with the wrong column length for the namespace column. This should not happen unless there is a new MySQL instance which does not get initialized in time before the first call the the MLX API to GET /apis/v1alpha1/pipelines

ckadner commented 3 years ago

This may be an instance of inopportune timing due to the stress test scenario. If we need to support that, I can make changes to the MLX API. (In the Docker Compose setup I made the catalog upload service dependent on the MySQL service having finished the initialization.)

yhwang commented 3 years ago

@ckadner I guess the problem is caused by the second or third pod when we scale up the mlx-api. Like I mentioned, we always do the quickstart import when the replicas=1, the 1st pod. Then I scale up the mlx-api to replicas=2 or 3. And this error will show up in 2nd and 3rd pod.

ckadner commented 3 years ago

@ckadner I guess the problem is caused by the second or third pod when we scale up the mlx-api. Like I mentioned, we always do the quickstart import when the replicas=1, the 1st pod. Then I scale up the mlx-api to replicas=2 or 3. And this error will show up in 2nd and 3rd pod.

The 2nd or 3rd replica of MLX-API are connecting to the same (already initialized) MySQL database.

ckadner commented 3 years ago

The MLX API is not designed to be running with multiple replicas: