ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.77k stars 5.75k forks source link

[Data] RayTaskError(AttributeError) Can't get attribute 'CacheOptions._reconstruct' on <module 'pyarrow.lib' #48592

Open francesco086 opened 2 hours ago

francesco086 commented 2 hours ago

What happened + What you expected to happen

The bug

Context:

What happens:

Expected behavior

Script should be successful using the kubernetes ray cluster

Useful information

Full log (I manually replaced STORAGE_ACCOUNT, STORAGE_CONTAINER, path-to-folder):

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1730874439.457085 28057169 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1730874442.352737 28055470 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers
Metadata Fetch Progress 0:   0%|          | 0.00/5.00 [00:34<?, ? task/s]
Metadata Fetch Progress 0:   0%|          | 0.00/5.00 [14:50<?, ? task/s]
---------------------------------------------------------------------------
RayTaskError(AttributeError)              Traceback (most recent call last)
File /Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/explore.py:3
      [1](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/explore.py:1) # %%
----> [3](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/explore.py:3) ds = ray.data.read_parquet(
      [4](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/explore.py:4)     paths="az://STORAGE_CONTAINER/path-to-folder/",
      [5](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/explore.py:5)     filesystem=adlfs.AzureBlobFileSystem(account_name="STORAGE_ACCOUNT", anon=False),
      [6](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/explore.py:6) )
      [8](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/explore.py:8) # ray.data.read_parquet()
     [10](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/explore.py:10) print(ds.schema())

File ~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:743, in read_parquet(paths, filesystem, columns, parallelism, ray_remote_args, tensor_column_schema, meta_provider, partition_filter, partitioning, shuffle, include_paths, file_extensions, concurrency, override_num_blocks, **arrow_parquet_args)
    [741](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:741) _block_udf = arrow_parquet_args.pop("_block_udf", None)
    [742](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:742) schema = arrow_parquet_args.pop("schema", None)
--> [743](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:743) datasource = ParquetDatasource(
    [744](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:744)     paths,
    [745](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:745)     columns=columns,
    [746](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:746)     dataset_kwargs=dataset_kwargs,
    [747](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:747)     to_batch_kwargs=arrow_parquet_args,
    [748](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:748)     _block_udf=_block_udf,
    [749](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:749)     filesystem=filesystem,
    [750](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:750)     schema=schema,
    [751](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:751)     meta_provider=meta_provider,
    [752](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:752)     partition_filter=partition_filter,
    [753](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:753)     partitioning=partitioning,
    [754](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:754)     shuffle=shuffle,
    [755](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:755)     include_paths=include_paths,
    [756](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:756)     file_extensions=file_extensions,
    [757](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:757) )
    [758](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:758) return read_datasource(
    [759](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:759)     datasource,
    [760](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:760)     parallelism=parallelism,
   (...)
    [763](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:763)     override_num_blocks=override_num_blocks,
    [764](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/read_api.py:764) )

File ~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:282, in ParquetDatasource.__init__(self, paths, columns, dataset_kwargs, to_batch_kwargs, _block_udf, filesystem, schema, meta_provider, partition_filter, partitioning, shuffle, include_paths, file_extensions)
    [272](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:272)     else:
    [273](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:273)         # Use the scheduling strategy ("SPREAD" by default) provided in
    [274](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:274)         # `DataContext``, to spread out prefetch tasks in cluster, avoid
    [275](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:275)         # AWS S3 throttling error.
    [276](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:276)         # Note: this is the same scheduling strategy used by read tasks.
    [277](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:277)         prefetch_remote_args[
    [278](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:278)             "scheduling_strategy"
    [279](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:279)         ] = DataContext.get_current().scheduling_strategy
    [281](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:281)     self._metadata = (
--> [282](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:282)         meta_provider.prefetch_file_metadata(
    [283](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:283)             pq_ds.fragments, **prefetch_remote_args
    [284](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:284)         )
    [285](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:285)         or []
    [286](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:286)     )
    [287](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:287) except OSError as e:
    [288](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py:288)     _handle_read_os_error(e, paths)

File ~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:150, in ParquetMetadataProvider.prefetch_file_metadata(self, fragments, **ray_remote_args)
    [141](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:141)     def fetch_func(fragments):
    [142](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:142)         return _fetch_metadata_serialization_wrapper(
    [143](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:143)             fragments,
    [144](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:144)             # Ensure that retry settings are propagated to remote tasks.
   (...)
    [147](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:147)             retry_max_interval=RETRY_MAX_BACKOFF_S_FOR_META_FETCH_TASK,
    [148](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:148)         )
--> [150](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:150)     raw_metadata = list(
    [151](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:151)         _fetch_metadata_parallel(
    [152](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:152)             fragments,
    [153](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:153)             fetch_func,
    [154](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:154)             FRAGMENTS_PER_META_FETCH,
    [155](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:155)             **ray_remote_args,
    [156](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:156)         )
    [157](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:157)     )
    [158](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:158) else:
    [159](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py:159)     raw_metadata = _fetch_metadata(fragments)

File ~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/file_meta_provider.py:411, in _fetch_metadata_parallel(uris, fetch_func, desired_uris_per_task, **ray_remote_args)
    [409](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/file_meta_provider.py:409)         continue
    [410](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/file_meta_provider.py:410)     fetch_tasks.append(remote_fetch_func.remote(uri_chunk))
--> [411](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/file_meta_provider.py:411) results = metadata_fetch_bar.fetch_until_complete(fetch_tasks)
    [412](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/file_meta_provider.py:412) yield from itertools.chain.from_iterable(results)

File ~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/progress_bar.py:185, in ProgressBar.fetch_until_complete(self, refs)
    [183](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/progress_bar.py:183)     fetch_local = False
    [184](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/progress_bar.py:184) total_rows_processed = 0
--> [185](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/progress_bar.py:185) for ref, result in zip(done, ray.get(done)):
    [186](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/progress_bar.py:186)     ref_to_result[ref] = result
    [187](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/_internal/progress_bar.py:187)     num_rows = extract_num_rows(result)

File ~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py:21, in wrap_auto_init.<locals>.auto_init_wrapper(*args, **kwargs)
     [18](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py:18) @wraps(fn)
     [19](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py:19) def auto_init_wrapper(*args, **kwargs):
     [20](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py:20)     auto_init_ray()
---> [21](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py:21)     return fn(*args, **kwargs)

File ~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py:102, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
     [98](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py:98) if client_mode_should_convert():
     [99](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py:99)     # Legacy code
    [100](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py:100)     # we only convert init function if RAY_CLIENT_MODE=1
    [101](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py:101)     if func.__name__ != "init" or is_client_mode_enabled_by_default:
--> [102](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py:102)         return getattr(ray, func.__name__)(*args, **kwargs)
    [103](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py:103) return func(*args, **kwargs)

File ~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/api.py:42, in _ClientAPI.get(self, vals, timeout)
     [35](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/api.py:35) def get(self, vals, *, timeout=None):
     [36](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/api.py:36)     """get is the hook stub passed on to replace `ray.get`
     [37](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/api.py:37) 
     [38](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/api.py:38)     Args:
     [39](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/api.py:39)         vals: [Client]ObjectRef or list of these refs to retrieve.
     [40](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/api.py:40)         timeout: Optional timeout in milliseconds
     [41](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/api.py:41)     """
---> [42](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/api.py:42)     return self.worker.get(vals, timeout=timeout)

File ~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/worker.py:433, in Worker.get(self, vals, timeout)
    [431](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/worker.py:431)     op_timeout = max_blocking_operation_time
    [432](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/worker.py:432) try:
--> [433](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/worker.py:433)     res = self._get(to_get, op_timeout)
    [434](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/worker.py:434)     break
    [435](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/worker.py:435) except GetTimeoutError:

File ~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/worker.py:461, in Worker._get(self, ref, timeout)
    [459](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/worker.py:459)         logger.exception("Failed to deserialize {}".format(chunk.error))
    [460](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/worker.py:460)         raise
--> [461](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/worker.py:461)     raise err
    [462](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/worker.py:462) if chunk.total_size > OBJECT_TRANSFER_WARNING_SIZE and log_once(
    [463](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/worker.py:463)     "client_object_transfer_size_warning"
    [464](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/worker.py:464) ):
    [465](https://file+.vscode-resource.vscode-cdn.net/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/~/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/util/client/worker.py:465)     size_gb = chunk.total_size / 2**30

RayTaskError(AttributeError): ray::fetch_func() (pid=211460, ip=10.224.15.29)
  File "/Users/cesco/Code/bonus-optimization/source-data/signup-data/dac/venv/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py", line 142, in fetch_func
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/datasource/parquet_meta_provider.py", line 174, in _fetch_metadata_serialization_wrapper
    deserialized_fragments = _deserialize_fragments_with_retry(fragments)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py", line 502, in _deserialize_fragments_with_retry
    return call_with_retry(
           ^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 986, in call_with_retry
    raise e from None
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 973, in call_with_retry
    return f()
           ^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py", line 503, in <lambda>
    lambda: _deserialize_fragments(fragments),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py", line 124, in _deserialize_fragments
    return [p.deserialize() for p in serialized_fragments]
            ^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py", line 114, in deserialize
    (file_format, path, filesystem, partition_expression) = cloudpickle.loads(
                                                            ^^^^^^^^^^^^^^^^^^
AttributeError: Can't get attribute 'CacheOptions._reconstruct' on <module 'pyarrow.lib' from '/home/ray/anaconda3/lib/python3.12/site-packages/pyarrow/lib.cpython-312-x86_64-linux-gnu.so'>

Versions / Dependencies

Python version used by the script:

Python 3.12.7

Libraries used by the script:

adlfs==2024.7.0
aiohappyeyeballs==2.4.3
aiohttp==3.10.10
aiosignal==1.3.1
annotated-types==0.7.0
appnope==0.1.4
asttokens==2.4.1
attrs==24.2.0
azure-core==1.32.0
azure-datalake-store==0.0.53
azure-identity==1.19.0
azure-storage-blob==12.23.1
certifi==2024.8.30
cffi==1.17.1
charset-normalizer==3.4.0
click==8.1.7
comm==0.2.2
cryptography==43.0.3
debugpy==1.8.7
decorator==5.1.1
executing==2.1.0
filelock==3.16.1
frozenlist==1.5.0
fsspec==2024.10.0
grpcio==1.67.1
idna==3.10
ipykernel==6.29.5
ipython==8.29.0
isodate==0.7.2
jedi==0.19.1
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
jupyter-client==8.6.3
jupyter-core==5.7.2
matplotlib-inline==0.1.7
modin==0.32.0
msal==1.31.0
msal-extensions==1.2.0
msgpack==1.1.0
multidict==6.1.0
multimethod==1.10
mypy-extensions==1.0.0
nest-asyncio==1.6.0
numpy==2.1.3
packaging==24.1
pandas==2.2.3
pandera==0.20.4
parso==0.8.4
pexpect==4.9.0
platformdirs==4.3.6
portalocker==2.10.1
prompt-toolkit==3.0.48
propcache==0.2.0
protobuf==5.28.3
psutil==6.1.0
ptyprocess==0.7.0
pure-eval==0.2.3
pyarrow==18.0.0
pycparser==2.22
pydantic==2.9.2
pydantic-core==2.23.4
pygments==2.18.0
pyjwt==2.9.0
python-dateutil==2.9.0.post0
pytz==2024.2
pyyaml==6.0.2
pyzmq==26.2.0
ray==2.38.0
referencing==0.35.1
requests==2.32.3
rpds-py==0.20.1
six==1.16.0
stack-data==0.6.3
tornado==6.4.1
tqdm==4.66.6
traitlets==5.14.3
typeguard==4.4.1
typing-extensions==4.12.2
typing-inspect==0.9.0
tzdata==2024.2
urllib3==2.2.3
wcwidth==0.2.13
wrapt==1.16.0
yarl==1.17.1

OS used: MacOS 14.7

values.yaml used to deploy the ray cluster using the helm chart kuberay/ray-cluster --version 1.2.2:

image:
  tag: 2.38.0-py312
head:
  resources:
    requests:
      cpu: 1
      memory: 2G
    limits:
      cpu: 4
      memory: 4G
worker:
  replicas: 1
  minReplicas: 1
  maxReplicas: 32
  nodeSelector: {}
  resources:
    requests:
      cpu: 1
      memory: 2G
    limits:
      cpu: 1
      memory: 2G

Reproduction script

Script:

import adlfs
import ray

ray.init()

ds = ray.data.read_parquet(
    paths="az://STORAGE_CONTAINER/path-to-folder/",
    filesystem=adlfs.AzureBlobFileSystem(account_name="STORAGE_ACCOUNT", anon=False),
)

I use the env variable RAY_ADDRESS to control the cluster to use.

Issue Severity

High: It blocks me from completing my task.

francesco086 commented 2 hours ago

I tried the same, but loading a single parquet file instead of all the parquet in a folder, and it still fails.