Closed skirui-source closed 10 months ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Seeing the error below:
RuntimeError: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments. For more information, see https://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
Cell In[20], line 2
1 tic = timer()
----> 2 X_train, y_train, X_infer, y_infer = taxi_data_loader(
3 client,
4 adlsaccount="azureopendatastorage",
5 adlspath="az://nyctlc/yellow/puYear=2014/puMonth=1*/*.parquet",
6 infer_frac=0.1,
7 random_state=42,
8 )
9 toc = timer()
10 print(f"Wall clock time taken for ETL and persisting : {toc-tic} s")
Cell In[19], line 95, in taxi_data_loader(client, adlsaccount, adlspath, response_dtype, infer_frac, random_state)
93 response_id = "fareAmount"
94 storage_options = {"account_name": adlsaccount}
---> 95 taxi_data = dask_cudf.read_parquet(
96 adlspath,
97 storage_options=storage_options,
98 chunksize=25e6,
99 npartitions=len(workers),
100 )
101 taxi_data = clean(taxi_data, must_haves)
102 taxi_data = taxi_data.map_partitions(add_features)
File ~/anaconda3/envs/rapids-23.08/lib/python3.10/site-packages/dask_cudf/io/parquet.py:539, in read_parquet(path, columns, **kwargs)
536 kwargs["read"] = {}
537 kwargs["read"]["check_file_size"] = check_file_size
--> 539 return dd.read_parquet(path, columns=columns, engine=CudfEngine, **kwargs)
File ~/anaconda3/envs/rapids-23.08/lib/python3.10/site-packages/dask/backends.py:138, in CreationDispatch.register_inplace.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
136 return func(*args, **kwargs)
137 except Exception as e:
--> 138 raise type(e)(
139 f"An error occurred while calling the {funcname(func)} "
140 f"method registered to the {self.backend} backend.\n"
141 f"Original Message: {e}"
142 ) from e
RuntimeError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments. For more information, see https://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments
Resolved the ForestInference load issue, all cells now working correctly! but will need to clear all outputs as it includes some of my personal info like email.
@jacobtomlinson this PR has been ready for another review/merge. please take a look when you can.
Fixes #211 - migration of azure_mnmg_daskcloudprovider notebook
See #203 (comment) for detailed migration instructions.
Tasks