iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.85k stars 1.18k forks source link

Endpoint URL is not taken into account when adding an external file from Minio #4151

Closed lucasmaheo closed 4 years ago

lucasmaheo commented 4 years ago

Bug Report

Please provide information about your setup

Output of dvc version:

$ dvc version
WARNING: Unable to detect supported link types, as cache directory '.dvc/cache' doesn't exist. It is usually auto-created by commands such as `dvc add/fetch/pull/run/import`, but you could create it manually to enable this check.
DVC version: 1.1.2
Python version: 3.7.7
Platform: Linux-4.20.17-042017-generic-x86_64-with-debian-stretch-sid
Binary: False
Package: pip
Supported remotes: http, https, s3
Repo: dvc, git
Filesystem type (workspace): ('ext4', '/dev/sda2')

Additional Information (if any):

I was trying out DVC and I cannot make it work with a local deployment of Minio. Minio is hosted at 127.0.0.1:9000 and works as expected, I tested it.

Contents of .dvc/config:

[cache]
    s3 = s3cache
['remote "s3cache"']
    url = s3://mybucket
    endpointurl = http://127.0.0.1:9000

Logs:

$ dvc add s3://mybucket/textfile --external --verbose
2020-07-02 09:05:55,227 DEBUG: fetched: [(3,)]
2020-07-02 09:05:55,583 DEBUG: fetched: [(0,)]
2020-07-02 09:05:55,587 ERROR: unexpected error - An error occurred (403) when calling the HeadObject operation: Forbidden
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/main.py", line 53, in main
    ret = cmd.run()
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/command/add.py", line 22, in run
    external=self.args.external,
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/__init__.py", line 36, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/add.py", line 91, in add
    stage.save()
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/stage/__init__.py", line 380, in save
    self.save_outs()
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/stage/__init__.py", line 391, in save_outs
    out.save()
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/output/base.py", line 253, in save
    if not self.exists:
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/output/base.py", line 189, in exists
    return self.remote.tree.exists(self.path_info)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/remote/s3.py", line 133, in exists
    return self.isfile(path_info) or self.isdir(path_info)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/remote/s3.py", line 166, in isfile
    self.s3.head_object(Bucket=path_info.bucket, Key=path_info.path)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/botocore/client.py", line 316, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/botocore/client.py", line 637, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

After some investigation, dvc does seem to take into account the configuration and the endpointurl. However on this specific boto3 request it does not. I did not go much more into the code to find out why the two s3 clients are generated from different configurations.

Configuration for the failing request:

{'url_path': '/mybucket/textfile', 'query_string': {}, 'method': 'HEAD', 'headers': {'User-Agent': 'Boto3/1.14.14 Python/3.7.7 Linux/4.20.17-042017-generic Botocore/1.17.14'}, 'body': b'', 'url': 'https://s3.amazonaws.com/mybucket/textfile', 'context': {'client_region': 'us-east-1', 'client_config': <botocore.config.Config object at 0x7f4fd6aaf310>, 'has_streaming_input': False, 'auth_type': None, 'signing': {'bucket': 'mybucket'}, 'timestamp': '20200702T130555Z'}}

Configuration loaded by DVC at some point during the call:

{'url': 's3://mybucket', 'endpointurl': 'http://127.0.0.1:9000', 'use_ssl': True, 'listobjects': False}

Any idea as to why this behaviour is happening?

Thanks, Lucas

efiop commented 4 years ago

@lucasmaheo That happens because you use direct s3 url in your dvc add command. What you should use instead is a remote:// addressing. E.g. right now you have s3cache remote already, but you could define a similar separate one and use it. E.g. with s3cache you could:

dvc add remote://s3cache/textfile

but I would suggest something like:

[cache]
    s3 = s3cache
['remote "mys3"']
    url = s3://mybucket
    endpointurl = http://127.0.0.1:9000
['remote "s3ache"']
    url = remote://mys3/cache

and then just

dvc add remote://mys3/path/to/file

:slightly_smiling_face:

The external workspaces scenario is admitedly not very polished right now and has some flaws, so we've created https://github.com/iterative/dvc/issues/3920 to discuss how it should be changed to become better.

skshetry commented 4 years ago

Duplicate of #1280

skshetry commented 4 years ago

Duplicate of #3441

Also, we have an outstanding docs issue: https://github.com/iterative/dvc.org/issues/108

lucasmaheo commented 4 years ago

Thank you for the explanation @efiop. Indeed this is a misunderstanding on my part. How do these instructions not end up on the dvc add page? At any rate, this fixed it.

If anyone else ends up on this issue, the following command worked:

dvc add remote://s3cache/somefile --external
efiop commented 4 years ago

@lucasmaheo The reason is that functionality is considered very advanced and not polished enough, so it is even hard to describe it nicely in the docs :( But we do have a few tickets with remote:// notation explanation for the docs. Btw, could you elaborate on your scenario? Maybe you don't actually need this functionality.

lucasmaheo commented 4 years ago

My scenario is hypothetical at this point. The typical use case would be to be able to version data files as well as ML models in a scalable storage. Usually, our projects use cloud storage (or on-premise, cloud-like storage) to have a single reference for data. We are looking for a solution to version efficiently those voluminous datasets, and DVC seems to fit the bill.

efiop commented 4 years ago

@lucasmaheo Sounds like that is indeed not the best approach. The first problem here is the isolation - if any user of your dvc repo runs dvc checkout data on s3 will change for everyone, which is a really bad practice unless you really know what you are doing. I would suggest storing your data in a regular dvc repo and access it through dvc. We have things like dvc get/import and even python API https://dvc.org/doc/api-reference that allow you to access your data using a human readable name for the artifact that you need. See https://dvc.org/doc/use-cases/data-registries .

lucasmaheo commented 4 years ago

Oh thanks for the clarification. That is indeed not the behaviour I was looking for.

So DVC registries are to be used with local copies, as I understand. Exactly in the same way as git registries, with the exception of the added possibility to select only required files. To circumvent that, we need to use the API. It all makes sense now.

At least I now know how to create the data registry. I am expecting that using dvc.api.open() with rev left blank should read from the version of the data that was committed with the current revision of the Git local repository.

Now is there a way to stream outputs to the remote registry? Supposing I was reading data from S3 and producing a transformed version of that data iteratively. If the outputs do not fit on disk, I would prefer to output to another location in S3 and after the whole process, push that version of data to the registry. Is that a feature you are looking into (pushing from a remote location)?

efiop commented 4 years ago

Now is there a way to stream outputs to the remote registry?

@lucasmaheo You mean kinda like publish them there? Currently there is no way to push it back like that :slightly_frowning_face: But we've heard requests for that. In my mind should be some special operation that does "straight to remote" action. So... something like

dvc add data --from s3://bucket/path --push --no-download (sorry for lame options, just talking out of my head right now)

that would create data.dvc as if you would downloaded it by-hand and then dvc add dataed, but it wouldn't actually download to your disk, but rather would stream the data from s3://bucket/path, compute the needed hash on-the-fly and upload it to our remote on-the-fly. Clearly, in this approach, we would still use the network traffic to stream the file, but at least we won't use your local storage. That could also be avoided if the cloud could provide us with a real md5, but that is another topic for discussion.

efiop commented 4 years ago

I feel like that ^ covers the most misuses that we've seen. Maybe it is even worth doing that by default when someone tries to feed a url to dvc add. E.g.

dvc add s3://bucket/path/data

would just create data.dvc and stream s3://bucket/path/data to compute the hash and push to the default remote. Not sure...