Closed lucasmaheo closed 4 years ago
@lucasmaheo That happens because you use direct s3
url in your dvc add
command. What you should use instead is a remote:// addressing. E.g. right now you have s3cache
remote already, but you could define a similar separate one and use it.
E.g. with s3cache
you could:
dvc add remote://s3cache/textfile
but I would suggest something like:
[cache]
s3 = s3cache
['remote "mys3"']
url = s3://mybucket
endpointurl = http://127.0.0.1:9000
['remote "s3ache"']
url = remote://mys3/cache
and then just
dvc add remote://mys3/path/to/file
:slightly_smiling_face:
The external workspaces scenario is admitedly not very polished right now and has some flaws, so we've created https://github.com/iterative/dvc/issues/3920 to discuss how it should be changed to become better.
Duplicate of #1280
Duplicate of #3441
Also, we have an outstanding docs issue: https://github.com/iterative/dvc.org/issues/108
Thank you for the explanation @efiop. Indeed this is a misunderstanding on my part. How do these instructions not end up on the dvc add
page? At any rate, this fixed it.
If anyone else ends up on this issue, the following command worked:
dvc add remote://s3cache/somefile --external
@lucasmaheo The reason is that functionality is considered very advanced and not polished enough, so it is even hard to describe it nicely in the docs :( But we do have a few tickets with remote://
notation explanation for the docs. Btw, could you elaborate on your scenario? Maybe you don't actually need this functionality.
My scenario is hypothetical at this point. The typical use case would be to be able to version data files as well as ML models in a scalable storage. Usually, our projects use cloud storage (or on-premise, cloud-like storage) to have a single reference for data. We are looking for a solution to version efficiently those voluminous datasets, and DVC seems to fit the bill.
@lucasmaheo Sounds like that is indeed not the best approach. The first problem here is the isolation - if any user of your dvc repo runs dvc checkout
data on s3 will change for everyone, which is a really bad practice unless you really know what you are doing. I would suggest storing your data in a regular dvc repo and access it through dvc. We have things like dvc get/import
and even python API https://dvc.org/doc/api-reference that allow you to access your data using a human readable name for the artifact that you need. See https://dvc.org/doc/use-cases/data-registries .
Oh thanks for the clarification. That is indeed not the behaviour I was looking for.
So DVC registries are to be used with local copies, as I understand. Exactly in the same way as git registries, with the exception of the added possibility to select only required files. To circumvent that, we need to use the API. It all makes sense now.
At least I now know how to create the data registry. I am expecting that using dvc.api.open() with rev left blank should read from the version of the data that was committed with the current revision of the Git local repository.
Now is there a way to stream outputs to the remote registry? Supposing I was reading data from S3 and producing a transformed version of that data iteratively. If the outputs do not fit on disk, I would prefer to output to another location in S3 and after the whole process, push that version of data to the registry. Is that a feature you are looking into (pushing from a remote location)?
Now is there a way to stream outputs to the remote registry?
@lucasmaheo You mean kinda like publish them there? Currently there is no way to push it back like that :slightly_frowning_face: But we've heard requests for that. In my mind should be some special operation that does "straight to remote" action. So... something like
dvc add data --from s3://bucket/path --push --no-download (sorry for lame options, just talking out of my head right now)
that would create data.dvc
as if you would downloaded it by-hand and then dvc add data
ed, but it wouldn't actually download to your disk, but rather would stream the data from s3://bucket/path, compute the needed hash on-the-fly and upload it to our remote on-the-fly. Clearly, in this approach, we would still use the network traffic to stream the file, but at least we won't use your local storage. That could also be avoided if the cloud could provide us with a real md5, but that is another topic for discussion.
I feel like that ^ covers the most misuses that we've seen. Maybe it is even worth doing that by default when someone tries to feed a url to dvc add
. E.g.
dvc add s3://bucket/path/data
would just create data.dvc
and stream s3://bucket/path/data to compute the hash and push to the default remote. Not sure...
Bug Report
Please provide information about your setup
Output of
dvc version
:Additional Information (if any):
I was trying out DVC and I cannot make it work with a local deployment of Minio. Minio is hosted at
127.0.0.1:9000
and works as expected, I tested it.Contents of .dvc/config:
Logs:
After some investigation, dvc does seem to take into account the configuration and the endpointurl. However on this specific boto3 request it does not. I did not go much more into the code to find out why the two s3 clients are generated from different configurations.
Configuration for the failing request:
Configuration loaded by DVC at some point during the call:
Any idea as to why this behaviour is happening?
Thanks, Lucas