mlflow / mlflow

Open source platform for the machine learning lifecycle
https://mlflow.org
Apache License 2.0
18.89k stars 4.25k forks source link

[BUG] MLflow is very slow with authentication enabled. #9684

Open n3011 opened 1 year ago

n3011 commented 1 year ago

Issues Policy acknowledgement

Willingness to contribute

No. I cannot contribute a bug fix at this time.

MLflow version

System information

Describe the problem

When enabling basic_auth , MLflow is very slow, 40X to 90X slower than without using authentication.

Here is a script to reproduce the runtime (assuming you have a mlflow instance running at port 5000)

import os
import time
os.environ["MLFLOW_TRACKING_USERNAME"]="admin"
os.environ["MLFLOW_TRACKING_PASSWORD"]="password"

import mlflow
from mlflow import MlflowClient
from mlflow.server import get_app_client

tracking_uri = "http://localhost:5000/"

client = MlflowClient(tracking_uri=tracking_uri)
t_start = time.time()
experiment_id = client.create_experiment(name="experiment")
print("Experiment creation took: ", time.time()-t_start)

for idx in range(3):
    t_start = time.time()
    tags = {"random": "random_tag", "mlflow.user": "test-user"}
    client.create_run(experiment_id, tags=tags, run_name="run_" + str(idx))
    print("Run creation took: ", time.time()-t_start)

t_start = time.time()
runs = client.search_runs(
    experiment_id,
    run_view_type=mlflow.entities.ViewType.ALL,
    max_results=mlflow.store.tracking.SEARCH_MAX_RESULTS_THRESHOLD,
)
print("Run retrieval took: ", time.time()-t_start)

Runtime, when starting MLflow server using mlflow server

Experiment creation took:  0.018462657928466797
Run creation took:  0.006806612014770508
Run creation took:  0.006391763687133789
Run creation took:  0.0066242218017578125
Run retrieval took:  0.005159139633178711

Runtime when starting MLflow server using mlflow server --app-name basic-auth

Run creation took:  0.4615945816040039
Run creation took:  0.46465229988098145
Run creation took:  0.45903968811035156
Run retrieval took:  0.45479750633239746

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

harupy commented 1 year ago

Pre and post-request processing for authentication might be causing overhead. I'll check.

github-actions[bot] commented 1 year ago

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

lightnessofbein commented 1 year ago

Hey @n3011 @harupy ! I came across the issue and took some time to debug the server's request handling. It turns out that checking password hash is what takes the majority of the time for a request to be processed. Seems to be functioning as designed, rather than an unintended bug?

vcim commented 1 year ago

thanks for the response @lightnessofbein I face the same issue since basic http authentication enabled, mlflow client needs way more time to call e.g. mlflow.models.get_model_info() Does mlflow needs to authenticate everytime something is fetched from the database ?

marcinkosztolowicz commented 3 weeks ago

I have the same issue. Mlflow 2.14.1 and remote postgres database. I have debug that a bit and it looks like for every Server UI click on for example on experiment there is a couple of requests to tracking server (let say around 10). Every time hash function is executed but also every time session is created. And it looks like, creation of session takes more time. In my case When I tested that locally hash function took around 0.07 second and session open took around 0.16 second. When we multiply that by 10 it tooks 2.5 second only for authentication and auth requests.

soonjune commented 1 week ago

This is also happening to me as search APIs perform refetch after inaccessible records are filtered in search within while loop. As the number of records increases it becomes a burden.