pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.13k stars 1.83k forks source link

CSV Downloads Fail for ADLS Gen2 with Azure CLI Authentication #17711

Open VincentSaelzlerFRA opened 1 month ago

VincentSaelzlerFRA commented 1 month ago

Checks

Reproducible example

import polars as pl

# Azure Data Lake Storge Gen2
STORAGE_ACCOUNT = "myaccount"
CONTAINER = "mycontainer"
STORAGE_OPTIONS = {"use_azure_cli": "True"}

df = pl.read_csv(
    source=f"abfss://{CONTAINER}@{STORAGE_ACCOUNT}.dfs.core.windows.net/example.csv",
    storage_options=STORAGE_OPTIONS,
)

Log output

Traceback (most recent call last):
[some stuff redacted for privacy, then read_csv()]
  File "/home/vscode/.local/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 91, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 91, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 91, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/polars/io/csv/functions.py", line 422, in read_csv
    with prepare_file_arg(
  File "/home/vscode/.local/lib/python3.12/site-packages/fsspec/core.py", line 105, in __enter__
    f = self.fs.open(self.path, mode=mode)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/fsspec/spec.py", line 1303, in open
    f = self._open(
        ^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/adlfs/spec.py", line 1833, in _open
    return AzureBlobFile(
           ^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/adlfs/spec.py", line 1959, in __init__
    if not hasattr(self, "details"):
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/fsspec/spec.py", line 1755, in details
    self._details = self.fs.info(self.path)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/home/vscode/.local/lib/python3.12/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
                ^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/adlfs/spec.py", line 623, in _info
    props = await bc.get_blob_properties(version_id=version_id)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/azure/core/tracing/decorator_async.py", line 94, in wrapper_use_tracer
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/azure/storage/blob/aio/_blob_client_async.py", line 783, in get_blob_properties
    process_storage_error(error)
  File "/home/vscode/.local/lib/python3.12/site-packages/azure/storage/blob/_shared/response_handlers.py", line 182, in process_storage_error
    exec("raise error from None")   # pylint: disable=exec-used # nosec
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 1, in <module>
  File "/home/vscode/.local/lib/python3.12/site-packages/azure/storage/blob/aio/_blob_client_async.py", line 773, in get_blob_properties
    blob_props = await self._client.blob.get_properties(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/azure/core/tracing/decorator_async.py", line 94, in wrapper_use_tracer
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/azure/storage/blob/_generated/aio/operations/_blob_operations.py", line 487, in get_properties
    map_error(status_code=response.status_code, response=response, error_map=error_map)
  File "/home/vscode/.local/lib/python3.12/site-packages/azure/core/exceptions.py", line 161, in map_error
    raise error
azure.core.exceptions.ClientAuthenticationError: Operation returned an invalid status 'Server failed to authenticate the request. Please refer to the information in the www-authenticate header.'
ErrorCode:NoAuthenticationInformation

Issue description

The failure happens because a call is made to get blob properties without passing any credentials.

Specifially, a HEAD request to https://STORAGE_ACCOUNT.blob.core.windows.net/CONTAINER/example.csv

I am sure that Azure CLI credentals are working in my environment, because replacing read_csv with read_parquet results in a successful file download.

df = pl.read_parquet(
    source=f"abfss://{CONTAINER}@{STORAGE_ACCOUNT}.dfs.core.windows.net/example.csv",
    storage_options=STORAGE_OPTIONS,
)
# error is expected because CSV file is not in parquet format.
# => polars.exceptions.ComputeError: parquet: File out of specification: incorrect magic in parquet footer

Also, I have been successfully using read_parquet on parquet files in the same storage container using the same credentials without issue.

Expected behavior

The CSV file contents would be loaded into a dataframe.

Installed versions

``` --------Version info--------- Polars: 1.2.1 Index type: UInt32 Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.31 Python: 3.12.3 (main, May 14 2024, 07:44:45) [GCC 10.2.1 20210110] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2024.6.1 gevent: great_tables: hvplot: matplotlib: nest_asyncio: 1.6.0 numpy: openpyxl: pandas: pyarrow: pydantic: 2.8.2 pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```

Also, adlfs==2024.4.1

nameexhaustion commented 1 month ago

We don't have support for inlining the authentication into the path - please pass the authentication information in the storage options - see https://docs.rs/object_store/latest/object_store/azure/enum.AzureConfigKey.html#variants for the keys.

ritchie46 commented 1 month ago

@nameexhaustion can we add a section in the user-guide on authentication of Polars? Better to save the indirection (object-store) and directly show it on our side.

VincentSaelzlerFRA commented 1 month ago

@nameexhaustion thanks for the prompt reply!

please pass the authentication information in the storage options

Unfortunately, none of the following authentication details are available to pass.

That's because "use_azure_cli": "True" is the authentication information. Per the documentation link you sent, it specifies that Polars should "Use azure cli for acquiring access token"

Passing extra parameters about the environment would be possible, if that helps. For example things like

nameexhaustion commented 1 month ago

@VincentSaelzlerFRA , could you try using scan_csv(..).collect() instead of read_csv? I had a look and it seems read_csv currently does not go through our native cloud downloading code path.

VincentSaelzlerFRA commented 1 month ago

@nameexhaustion using scan_csv(..).collect() succeeded. Thanks for the workaround!

Updated minimal working example:

import polars as pl

# Azure Data Lake Storge Gen2
STORAGE_ACCOUNT = "myaccount"
CONTAINER = "mycontainer"
STORAGE_OPTIONS = {"use_azure_cli": "True"}

lf = pl.scan_csv(
    source=f"abfss://{CONTAINER}@{STORAGE_ACCOUNT}.dfs.core.windows.net/example.csv",
    storage_options=STORAGE_OPTIONS,
)
df = lf.collect()