iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.43k stars 1.17k forks source link

run-cache storage at Azure #5899

Closed rmlopes closed 12 months ago

rmlopes commented 3 years ago

Hi,

I would like to setup an external Azure remote for the run-cache. It is not mentioned in the docs how to configure it (similarly to what is done with S3 or GCloud). Is this not supported yet?

From https://github.com/iterative/dvc.org/issues/520#issuecomment-827500594

efiop commented 3 years ago

@rmlopes Could you elaborate, please? Feels like you are mixing up run-cache with external outputs. The latter one is not supported on Azure. You can still push local run-cache to any remote (including azure) though.

If you indeed meant run-cache for external outputs scenario, then run-cache doesn't support it and there are no plans to add it until we redesign external scenarios in general, as right now it is highly experimental and we don't recommend using it usually.

rmlopes commented 3 years ago

@efiop sure, indeed there seems to be some confusion on my side. I think I mean run-cache correctly though, as it states in the docs "setting an external cache location" so that we don't have to do a commit for every experiment (I have setup remote storage for data, I intended to do the same for the run-cache, and I did come across the external outputs section and noticed the disclaimer but for now we can live without it). In the documentation it states that for having the run-cache in external storage you add a remote and then configure cache, such as cache.s3 or cache.gs but azure is not mentioned.

Does this actually clarify my doubt?

efiop commented 3 years ago

In the documentation it states that for having the run-cache in external storage you add a remote and then configure cache, such as cache.s3 or cache.gs but azure is not mentioned.

Could you point out the doc that says that, please?

During normal workflow when you dvc push/pull data to/from a remote, you could specify --run-cache option that will also transfer the run-cache (and use --pull in dvc repro to automatically try to pull the results according to run-cache). If you are using a shared cache dir (dvc config cache.dir /path/to/dir), run-cache will be shared automatically between everyone using that cache dir, no extra actions needed.

So I think you got confused by the "external data management" doc, as it has nothing to do with the run-cache really. It would be great if you could point out particular docs that confused you.

rmlopes commented 3 years ago

The quote is from the config cache docs. But it seems that I got there from the "external data management" doc as you said. If I understood you correctly when we use the --run-cache flag on a push it will add the run-cache to the remote blob (the same that is used for data and models if we are not using a registry repository). A person or agent pulling afterwards can choose to include or not that run-cache as well.

I am still trying to put it all together for our specific use case but I think what I actually am looking for is more of an external dependency. Still not sure. We need to manually label some videos, these and the corresponding annotations would be stored in a cloud storage account. I want to be able to run the CI/CD without having to download the data (or at least download video by video).

rmlopes commented 3 years ago

I have setup two repositories, one is a data registry (data-registry), the other one is a project using data/models from that data registry (mlops-dvc), both using azure blobs as remote. I have properly configured the data-registry and added files tracked by DVC.

Inside the data-registry project I can do dvc list, however not with the -R option, as in this case I will have auth error:

ERROR: failed to list 'ssh://user@company-repo/mlops-dvc-registry.git' - Authentication to Azure Blob Storage via None failed.

Inside the mlops-dvc project I can list the registry (with the same caveat) but I cannot import it as it will output the same connection error as posted above. I want to have different remotes (both Azure but that should be irrelevant) but even if I configure exactly the same remote for both projects I still get the connection error.

What am I missing here?

efiop commented 3 years ago

The quote is from the config cache docs. But it seems that I got there from the "external data management" doc as you said. If I understood you correctly when we use the --run-cache flag on a push it will add the run-cache to the remote blob (the same that is used for data and models if we are not using a registry repository). A person or agent pulling afterwards can choose to include or not that run-cache as well.

I am still trying to put it all together for our specific use case but I think what I actually am looking for is more of an external dependency. Still not sure. We need to manually label some videos, these and the corresponding annotations would be stored in a cloud storage account. I want to be able to run the CI/CD without having to download the data (or at least download video by video).

Seems like you are talking about a labeling scenario, which we don't have a native support for yet, but we are looking into it right now. CC @volkfox

You could use external dependencies with dvc run -d though. E.g. dvc run -d azure://bucket/path ... or dvc run -d remote://myremote/path. Those don't require external cache.

Inside the data-registry project I can do dvc list, however not with the -R option, as in this case I will have auth error:

Could you add -v and post a verbose log, please? Also, please post dvc doctor.

rmlopes commented 3 years ago

Sure.

(base) ➜  mlops-temp git:(master) dvc list ssh://user@bitbucket-repo/mlops-temp.git -v
2021-04-30 23:05:30,686 DEBUG: Creating external repo ssh://user@bitbucket-repo/mlops-temp.git@None
2021-04-30 23:05:30,688 DEBUG: erepo: git clone 'ssh://user@bitbucket-repo/mlops-temp.git' to a temporary dir
.dvcignore                                                                                                                                              
data
(base) ➜  mlops-temp git:(master) dvc list -R ssh://user@bitbucket-repo/mlops-temp.git -v
2021-04-30 23:03:35,061 DEBUG: Creating external repo ssh://user@bitbucket-repo/mlops-temp.git@None
2021-04-30 23:03:35,062 DEBUG: erepo: git clone 'ssh://user@bitbucket-repo/mlops-temp.git' to a temporary dir
2021-04-30 23:03:36,318 DEBUG: Preparing to download data from 'azure://temp/temp'                                                                      
2021-04-30 23:03:36,318 DEBUG: Preparing to collect status from azure://temp/temp
2021-04-30 23:03:36,319 DEBUG: Collecting information from local cache...
2021-04-30 23:03:36,319 DEBUG: Collecting information from remote cache...                                                                              
2021-04-30 23:03:36,320 DEBUG: Matched '0' indexed hashes
2021-04-30 23:03:36,320 DEBUG: Querying 1 hashes via object_exists
2021-04-30 23:03:36,321 ERROR: failed to list 'ssh://user@bitbucket-repo/mlops-temp.git' - Authentication to Azure Blob Storage via None failed.
Learn more about configuration settings at <https://man.dvc.org/remote/modify>: unable to connect to account for Must provide either a connection_string or account_name with credentials!!
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/adlfs/spec.py", line 505, in do_connect
    raise ValueError(
ValueError: Must provide either a connection_string or account_name with credentials!!

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/fs/azure.py", line 123, in fs
    file_system = AzureBlobFileSystem(**self.fs_args)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/fsspec/spec.py", line 69, in __call__
    obj = super().__call__(*args, **kwargs)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/adlfs/spec.py", line 411, in __init__
    self.do_connect()
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/adlfs/spec.py", line 510, in do_connect
    raise ValueError(f"unable to connect to account for {e}")
ValueError: unable to connect to account for Must provide either a connection_string or account_name with credentials!!

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/command/ls/__init__.py", line 30, in run
    entries = Repo.ls(
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/repo/ls.py", line 38, in ls
    ret = _ls(repo.repo_fs, path_info, recursive, dvc_only)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/repo/ls.py", line 57, in _ls
    for root, dirs, files in fs.walk(
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/fs/repo.py", line 367, in walk
    yield from self._walk(repo_walk, dvc_walk, dvcfiles=dvcfiles)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/fs/repo.py", line 303, in _walk
    yield from self._walk(repo_walk, dvc_walk, dvcfiles=dvcfiles)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/fs/repo.py", line 305, in _walk
    yield from self._dvc_walk(dvc_walk)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/fs/repo.py", line 231, in _dvc_walk
    root, dirs, files = next(walk)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/fs/dvc.py", line 195, in walk
    yield from self._walk(root, trie, topdown=topdown, **kwargs)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/fs/dvc.py", line 169, in _walk
    yield from self._walk(root / dname, trie)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/fs/dvc.py", line 169, in _walk
    yield from self._walk(root / dname, trie)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/fs/dvc.py", line 150, in _walk
    self._add_dir(trie, out, **kwargs)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/fs/dvc.py", line 138, in _add_dir
    self._fetch_dir(out, **kwargs)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/fs/dvc.py", line 132, in _fetch_dir
    out.get_dir_cache(**kwargs)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/output/base.py", line 498, in get_dir_cache
    self.repo.cloud.pull(
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/data_cloud.py", line 88, in pull
    return remote.pull(
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/remote/base.py", line 56, in wrapper
    return f(obj, *args, **kwargs)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/remote/base.py", line 486, in pull
    ret = self._process(
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/remote/base.py", line 323, in _process
    dir_status, file_status, dir_contents = self._status(
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/remote/base.py", line 175, in _status
    self.hashes_exist(
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/remote/base.py", line 132, in hashes_exist
    return indexed_hashes + self.odb.hashes_exist(list(hashes), **kwargs)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/objects/db/base.py", line 380, in hashes_exist
    remote_hashes = self.list_hashes_exists(hashes, jobs, name)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/objects/db/base.py", line 338, in list_hashes_exists
    ret = list(itertools.compress(hashes, in_remote))
  File "/usr/local/Cellar/python@3.9/3.9.4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 608, in result_iterator
    yield fs.pop().result()
  File "/usr/local/Cellar/python@3.9/3.9.4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 445, in result
    return self.__get_result()
  File "/usr/local/Cellar/python@3.9/3.9.4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/usr/local/Cellar/python@3.9/3.9.4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/objects/db/base.py", line 329, in exists_with_progress
    ret = self.fs.exists(path_info)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/fs/fsspec_wrapper.py", line 94, in exists
    return self.fs.exists(self._with_bucket(path_info))
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/funcy/objects.py", line 50, in __get__
    return prop.__get__(instance, type)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/funcy/objects.py", line 28, in __get__
    res = instance.__dict__[self.fget.__name__] = self.fget(instance)
  File "/usr/local/Cellar/dvc/2.0.18/libexec/lib/python3.9/site-packages/dvc/fs/azure.py", line 129, in fs
    raise AzureAuthError(
dvc.fs.azure.AzureAuthError: Authentication to Azure Blob Storage via None failed.
Learn more about configuration settings at <https://man.dvc.org/remote/modify>
------------------------------------------------------------
2021-04-30 23:03:36,335 DEBUG: Analytics is enabled.
2021-04-30 23:03:36,562 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/hr/rkq97pts6hl9t_qmfb1l7l880000gp/T/tmp6blhdltn']'
2021-04-30 23:03:36,565 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/hr/rkq97pts6hl9t_qmfb1l7l880000gp/T/tmp6blhdltn']'
(base) ➜  mlops-temp git:(master) dvc doctor
DVC version: 2.0.18 (brew)
---------------------------------
Platform: Python 3.9.4 on macOS-11.2.3-x86_64-i386-64bit
Supports: azure, gdrive, gs, http, https, s3, ssh, oss, webdav, webdavs
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s1s1
Caches: local
Remotes: azure
Workspace directory: apfs on /dev/disk1s1s1
Repo: dvc, git
isidentical commented 3 years ago

@rmlopes did you set any configuration variables for the azure (via dvc remote modify)? Such as a connection string or account_name+account_key combination?

rmlopes commented 3 years ago

@isidentical I am using connection string and I have tried with both repos (the registry and the main repo) having the exact same config for the remote (including the connection string). Note that in the output above I am only doing it from the data registry side and the non-recursive list works as expected.