mlflow / mlflow

Open source platform for the machine learning lifecycle
https://mlflow.org
Apache License 2.0
18.76k stars 4.23k forks source link

[BUG] artifact_repo path drops the first directory in path (for example /dir1/dir2 -> /dir2) #11331

Open michshap opened 8 months ago

michshap commented 8 months ago

Issues Policy acknowledgement

Where did you encounter this bug?

Local machine

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

System information

Describe the problem

I use mlflow in the following way:

mlflow.set_tracking_uri('file:///dir1/dir2/mlruns')
client = mlflow.tracking.MlflowClient()
active_run = mlflow.start_run(run_name='a')

and everything works fine. however, when I try to use log_artifact, I see that it is trying to be saved at '/dir2/mlruns/0/...' meaning the /dir1/ was somehow dropped.

when debugging log_artifact, I see that [from log_artifact code]:

artifact_repo = self._get_artifact_repo(run_id)

returns: artifact_repo.artifact_dir = '/dir2/mlruns/0/...' artifact_repo.artifact_uri = 'file://dir1/dir2/mlruns/0...' - I'd guess it is supposed to be ''file:///dir1/dir2/mlruns/0...' as it is in the tracking_uri (3 '/' instead of 2 '/')?

How do I fix that? what causes it?

PS - I also tried

mlflow.set_tracking_uri('/dir1/dir2/mlruns')

and it gives the same error.

Tracking information

MLflow version: 2.8.1
Tracking URI: /dir1/dir2/mlruns/
Artifact URI: file://dir1/dir2/mlruns/0/d71476fa8cd94ca096473ea7dbc4e352/artifacts
System information: Linux #78-Ubuntu SMP Tue Apr 18 09:00:29 UTC 2023
Python version: 3.10.6
MLflow version: 2.8.1
MLflow module location: /home/venvs/my_dev5/lib/python3.10/site-packages/mlflow/__init__.py
Tracking URI: /dir1/dir2/mlruns/
Registry URI: /dir1/dir2/mlruns/
Active experiment ID: 0
Active run ID: d71476fa8cd94ca096473ea7dbc4e352
Active run artifact URI: file://dir1/dir2/mlruns/0/d71476fa8cd94ca096473ea7dbc4e352/artifacts
MLflow dependencies: 
  Flask: 3.0.0
  Jinja2: 3.1.2
  aiohttp: 3.9.0
  alembic: 1.12.1
  click: 8.1.3
  cloudpickle: 2.2.1
  databricks-cli: 0.18.0
  docker: 6.1.3
  entrypoints: 0.4
  fastapi: 0.95.0
  gitpython: 3.1.31
  gunicorn: 21.2.0
  importlib-metadata: 6.8.0
  markdown: 3.5.1
  matplotlib: 3.7.1
  numpy: 1.24.3
  packaging: 23.2
  pandas: 2.0.0
  protobuf: 4.23.4
  pyarrow: 14.0.1
  pydantic: 1.10.6
  pytz: 2023.3.post1
  pyyaml: 6.0
  querystring-parser: 1.2.4
  requests: 2.31.0
  scikit-learn: 1.3.2
  scipy: 1.10.1
  sqlalchemy: 1.4.41
  sqlparse: 0.4.4

Code to reproduce issue

import mlflow

mlflow.set_tracking_uri('file:///dir1/dir2/mlruns') #or mlflow.set_tracking_uri('/dir1/dir2/mlruns')
client = mlflow.tracking.MlflowClient()
active_run = mlflow.start_run(run_name='a')
client.log_artifact(
    run_id=active_run.info.run_id,
    local_path='/path/to/file/to/log',
    artifact_path='something',
)

Stack trace

no error (except from saving in a location that doesn't exists)

Other info / logs

REPLACE_ME

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

harupy commented 8 months ago

@michshap what happens when you run /dir1/dir2/mlruns?

michshap commented 8 months ago

@harupy what do you mean? when I run

mlflow ui --host 0.0.0.0 --port 8084 --backend-store-uri /dir1/dir2/mlruns/

I see the runs and their metrics etc. as I should. The problem occurs when I try to log something as artifact, then I'll get a "permission denied" error of writing to /dir2/.... as /dir2 doesn't exist (it is /dir1/dir2). So basically I can't log any artifacts.

serena-ruan commented 8 months ago

@michshap Could you upgrade MLflow version and try again? I can't reproduce this problem on latest master

michshap commented 8 months ago

@serena-ruan I upgraded and got the same behavior - Active run artifact URI: file://dir1/dir2/mlruns/0/d71476fa8cd94ca096473ea7dbc4e352/artifacts (only 2 '//') which then translates to trying to wrote the artifacts to "/dir2/mlruns/0/d71476fa8cd94ca096473ea7dbc4e352/artifacts" (skips the dir1/dir2).

in log_artifacts lin 552

    def log_artifact(self, run_id, local_path, artifact_path=None):
        """
        Write a local file or directory to the remote ``artifact_uri``.

        Args:
            local_path: Path to the file or directory to write.
            artifact_path: If provided, the directory in ``artifact_uri`` to write to.
        """
        artifact_repo = self._get_artifact_repo(run_id)

I see that artifact_repo.artifact_dir = /dir2/mlruns/0/d71476fa8cd94ca096473ea7dbc4e352/artifacts when debugging.

what do you see as active run artifact URI (active_run.info.artifact_uri)?

serena-ruan commented 8 months ago
>>> artifact_repo.artifact_dir
'/Users/serena.ruan/Documents/test/mlruns/0/c913c72646c94fb5b2fd8bdc3ca1110e/artifacts'
>>> artifact_repo.artifact_uri
'file:///Users/serena.ruan/Documents/test/mlruns/0/c913c72646c94fb5b2fd8bdc3ca1110e/artifacts'

My code is as this:

>>> mlflow.set_tracking_uri("file:///Users/serena.ruan/Documents/test/mlruns")
>>> with mlflow.start_run():
...     mlflow.log_artifact("/Users/serena.ruan/Documents/test/test.txt")

Are you running the same script? Providing stack trace and your original code should be helpful

michshap commented 8 months ago

@serena-ruan Thank you! so indeed something is off with the behavior I'm getting, that one of the "/" is dropped in the artifact_uri which leads to the entire first dir being dropped later...

when I run the exact code as you:

mlflow.set_tracking_uri("file:///home/my.name/mlruns")
with mlflow.start_run():
    mlflow.log_artifact("/home/my.name/example.txt")

I get:

Cell In[2], line 3
      1 mlflow.set_tracking_uri("file:///home/my.name/mlruns")
      2 with mlflow.start_run():
----> 3     mlflow.log_artifact("/home/my.name/example.txt")

File ~/venvs/my_dev5/lib/python3.10/site-packages/mlflow/tracking/fluent.py:1057, in log_artifact(local_path, artifact_path, run_id)
   1029 """
   1030 Log a local file or directory as an artifact of the currently active run. If no run is
   1031 active, this method will create a new active run.
   (...)
   1054             mlflow.log_artifact(path)
   1055 """
   1056 run_id = run_id or _get_or_start_run().info.run_id
-> 1057 MlflowClient().log_artifact(run_id, local_path, artifact_path)

File ~/venvs/my_dev5/lib/python3.10/site-packages/mlflow/tracking/client.py:1189, in MlflowClient.log_artifact(self, run_id, local_path, artifact_path)
   1150 def log_artifact(self, run_id, local_path, artifact_path=None) -> None:
   1151     """Write a local file or directory to the remote ``artifact_uri``.
   1152 
   1153     Args:
   (...)
   1187 
   1188     """
-> 1189     self._tracking_client.log_artifact(run_id, local_path, artifact_path)

File ~/venvs/my_dev5/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py:560, in TrackingServiceClient.log_artifact(self, run_id, local_path, artifact_path)
    558     artifact_repo.log_artifacts(local_path, path_name)
    559 else:
--> 560     artifact_repo.log_artifact(local_path, artifact_path)

File ~/venvs/my_dev5/lib/python3.10/site-packages/mlflow/store/artifact/local_artifact_repo.py:37, in LocalArtifactRepository.log_artifact(self, local_file, artifact_path)
     33 artifact_dir = (
     34     os.path.join(self.artifact_dir, artifact_path) if artifact_path else self.artifact_dir
     35 )
     36 if not os.path.exists(artifact_dir):
---> 37     mkdir(artifact_dir)
     38 try:
     39     shutil.copy2(local_file, os.path.join(artifact_dir, os.path.basename(local_file)))

File ~/venvs/my_dev5/lib/python3.10/site-packages/mlflow/utils/file_utils.py:212, in mkdir(root, name)
    210 except OSError as e:
    211     if e.errno != errno.EEXIST or not os.path.isdir(target):
--> 212         raise e
    213 return target

File ~/venvs/my_dev5/lib/python3.10/site-packages/mlflow/utils/file_utils.py:209, in mkdir(root, name)
    207 target = os.path.join(root, name) if name is not None else root
    208 try:
--> 209     os.makedirs(target)
    210 except OSError as e:
    211     if e.errno != errno.EEXIST or not os.path.isdir(target):

File /usr/lib/python3.10/os.py:215, in makedirs(name, mode, exist_ok)
    213 if head and tail and not path.exists(head):
    214     try:
--> 215         makedirs(head, exist_ok=exist_ok)
    216     except FileExistsError:
    217         # Defeats race condition when another thread created the path
    218         pass

File /usr/lib/python3.10/os.py:215, in makedirs(name, mode, exist_ok)
    213 if head and tail and not path.exists(head):
    214     try:
--> 215         makedirs(head, exist_ok=exist_ok)
    216     except FileExistsError:
    217         # Defeats race condition when another thread created the path
    218         pass

    [... skipping similar frames: makedirs at line 215 (1 times)]

File /usr/lib/python3.10/os.py:215, in makedirs(name, mode, exist_ok)
    213 if head and tail and not path.exists(head):
    214     try:
--> 215         makedirs(head, exist_ok=exist_ok)
    216     except FileExistsError:
    217         # Defeats race condition when another thread created the path
    218         pass

File /usr/lib/python3.10/os.py:225, in makedirs(name, mode, exist_ok)
    223         return
    224 try:
--> 225     mkdir(name, mode)
    226 except OSError:
    227     # Cannot rely on checking for EEXIST, since the operating system
    228     # could give priority to other errors like EACCES or EROFS
    229     if not exist_ok or not path.isdir(name):

PermissionError: [Errno 13] Permission denied: '/my.name'

[which makes sense because there isn't such location] and as I wrote:

>>> artifact_repo.artifact_dir
'/my.name/mlruns/0/b3a48ff5b8ae422fa21a5a08d00212d6/artifacts'
>>> artifact_repo.artifact_uri
'file://home/my.name/mlruns/0/b3a48ff5b8ae422fa21a5a08d00212d6/artifacts'
serena-ruan commented 8 months ago

@michshap There probably something wrong with your local file directories. Could you try the same code snippet in a new location (where no previous experiments exist)?

github-actions[bot] commented 8 months ago

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.