mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.08k stars 136 forks source link

Incorrect container name in download_from_azure #727

Closed jaehwana2z closed 1 month ago

jaehwana2z commented 1 month ago

Environment

To reproduce

Steps to reproduce the behavior:

  1. remote_dir = azure://account_name.blob.core.windows.net/container_name/path/to/blob
  2. dataset = StreamingDataset(local=local_dir, remote=remote_dir, batch_size=1, split=None, shuffle=True)

Expected behavior

Line 295 of download_from_azure in download.py parses the url as follows:

blob_client = service.get_blob_client(container=obj.netloc, blob=obj.path.lstrip('/'))

where obj.netloc == account_name.blob.core.windows.net and obj.path.lstrip('/') == container_name/path/to/blob from the example path in step 1. above.

However, this is incorrect, because to properly download the blob, container should be container_name and blob should be path/to/blob.

Fixing the line to

directories = obj.path.lstrip('/').split('/') blob_client = service.get_blob_client(container=directories[0], blob='/'.join(directories[1:]))

solves the issue for me

Additional context

snarayan21 commented 1 month ago

@jaehwana2z Hey, thanks for flagging! Mind submitting a PR? We always welcome community improvements, and happy to review.