iterative / dvc.org

📖 DVC website and documentation
https://dvc.org
Apache License 2.0
326 stars 386 forks source link

Managing External Data / Microsoft Azure Blob Storage #1934

Closed eedokl closed 3 years ago

eedokl commented 3 years ago

UPDATE: Jump to https://github.com/iterative/dvc.org/issues/1934#issuecomment-725444300

Hi, I am currently evaluating DVC for usage in my project. We are working with Microsoft Azure Storage Blob and I was successfully able to setup DVC to work with Azure when using it to push my data into it and pulling it from there. Also a import-url like dvc import-url azure://playground/test21323_43.parquet works fine for me.

However, I also wanted to try the methods

dvc remote add azurecache azure://playground/cache dvc config cache.azure azurecache dvc add --external azure://playground/test21323_42.parquet

but here I get stuck with error message:

(mybuild) > dvc add --external azure://playground/test21323_43.parquet -v
2020-11-11 09:34:37,670 DEBUG: Check for update is enabled.
2020-11-11 09:34:37,945 DEBUG: Trying to spawn '['daemon', '-q', 'updater']'
2020-11-11 09:34:38,096 DEBUG: Spawned '['daemon', '-q', 'updater']'
2020-11-11 09:34:38,101 DEBUG: fetched: [(3,)]
Adding...
2020-11-11 09:34:39,447 DEBUG: fetched: [(41,)]
2020-11-11 09:34:39,455 ERROR: output 'azure:\playground\test21323_43.parquet' does not exist
------------------------------------------------------------
Traceback (most recent call last):
  File "c:\users\llk7rt\.conda\envs\mybuild\lib\site-packages\dvc\command\add.py", line 22, in run
    external=self.args.external,
  File "c:\users\llk7rt\.conda\envs\mybuild\lib\site-packages\dvc\repo\__init__.py", line 54, in wrapper
    return f(repo, *args, **kwargs)
  File "c:\users\llk7rt\.conda\envs\mybuild\lib\site-packages\dvc\repo\scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "c:\users\llk7rt\.conda\envs\mybuild\lib\site-packages\dvc\repo\add.py", line 90, in add
    stage.save()
  File "c:\users\llk7rt\.conda\envs\mybuild\lib\site-packages\dvc\stage\__init__.py", line 386, in save
    self.save_outs(allow_missing=allow_missing)
  File "c:\users\llk7rt\.conda\envs\mybuild\lib\site-packages\dvc\stage\__init__.py", line 398, in save_outs
    out.save()
  File "c:\users\llk7rt\.conda\envs\mybuild\lib\site-packages\dvc\output\base.py", line 255, in save
    raise self.DoesNotExistError(self)
dvc.output.base.OutputDoesNotExistError: output 'azure:\playground\test21323_43.parquet' does not exist
------------------------------------------------------------
2020-11-11 09:34:39,475 DEBUG: Analytics is disabled.

The same error I get with (note that in the documentation page -n is missing which is required):

dvc run -v -n ext_data -d test2123x.parquet --external -o azure://playground/test2123x.parquet az storage blob upload -f test2123x.parquet -c playground -n test2123x.parquet

2020-11-11 09:45:05,830 DEBUG: Check for update is enabled.
2020-11-11 09:45:06,096 DEBUG: Trying to spawn '['daemon', '-q', 'updater']'
2020-11-11 09:45:06,245 DEBUG: Spawned '['daemon', '-q', 'updater']'
2020-11-11 09:45:06,252 DEBUG: fetched: [(3,)]
2020-11-11 09:45:07,709 DEBUG: Removing output 'azure:\playground\test2123x.parquet' of stage: 'ext_data'.
2020-11-11 09:45:07,744 DEBUG: Path 'C:\0_Repos\bst_dataengineering_ml\bst_dataengineering_ml\test2123x.parquet' inode '4765027818447448764'
2020-11-11 09:45:07,745 DEBUG: fetched: [('1605010673356305664', '207122', 'e2ce0f2f8c016fc40d7ff08dcfa0921f', '1605084273460453376')]
2020-11-11 09:45:07,747 DEBUG: {'test2123x.parquet': 'modified'}
2020-11-11 09:45:07,750 DEBUG: Path '...test2123x.parquet' inode '4765027818447448764'
2020-11-11 09:45:07,750 DEBUG: fetched: [('1605010673356305664', '207122', 'e2ce0f2f8c016fc40d7ff08dcfa0921f', '1605084307746182144')]
2020-11-11 09:45:07,753 DEBUG: Path '...test2123x.parquet' inode '4765027818447448764'
2020-11-11 09:45:07,754 DEBUG: fetched: [('1605010673356305664', '207122', 'e2ce0f2f8c016fc40d7ff08dcfa0921f', '1605084307751170304')]
Running stage 'ext_data' with command:
        az storage blob upload -f test2123x.parquet -c playground -n test2123x.parquet
2020-11-11 09:45:07,758 DEBUG: fetched: [(43,)]
Finished[#############################################################]  100.0000%
{
  "etag": "\"0x8D8861E1A3FD39F\"",
  "lastModified": "2020-11-11T08:45:12+00:00"
}
2020-11-11 09:45:12,117 DEBUG: fetched: [(3,)]
2020-11-11 09:45:12,123 DEBUG: Path 'C:\0_Repos\bst_dataengineering_ml\bst_dataengineering_ml\test2123x.parquet' inode '4765027818447448764'
2020-11-11 09:45:12,124 DEBUG: fetched: [('1605010673356305664', '207122', 'e2ce0f2f8c016fc40d7ff08dcfa0921f', '1605084307755159552')]
2020-11-11 09:45:12,127 DEBUG: fetched: [(43,)]
2020-11-11 09:45:12,134 ERROR: output 'azure:\playground\test2123x.parquet' does not exist
-

One difference of the logs here compared to the ones I get with import-url seems to be that I dont see any connection string debug outputs...so maybe azure is not recognized properly here?

pared commented 3 years ago

@eedokl thanks for reporting the issue! I am looking into it.

note that in the documentation page -n is missing which is required could you point us which documentation page has lacking -n? It's a leftover from the pre-1.0 version of DVC, that we must have omitted this one.

EDIT: Found the command in question when fixing docs.

pared commented 3 years ago

@eedokl Regretfully we do not support external outputs for Azure Blob Storage. Seems like we have a mistake in our docs because we mention that we do support that. I will fix that.

eedokl commented 3 years ago

@pared Thanks a lot for clarification and quick response. I already thought so when seeing no file pointing to Azure in the respective file list.

jorgeorpinel commented 3 years ago

I guess we can move this to the docs repo?

pared commented 3 years ago

@jorgeorpinel you are right, closing.

jorgeorpinel commented 3 years ago

Well I had moved it to dvc.org and linked it to #1927 but either way, same result. Thanks @pared