rapidsai / deployment

RAPIDS Deployment Documentation
https://docs.rapids.ai/deployment/stable/
9 stars 28 forks source link

Migrate `azure/notebooks/Azure-MNMG-XGBoost.ipynb` to deployment docs #253

Closed skirui-source closed 10 months ago

skirui-source commented 1 year ago

Fixes #211 - migration of azure_mnmg_daskcloudprovider notebook

See #203 (comment) for detailed migration instructions.

Tasks

review-notebook-app[bot] commented 1 year ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

skirui-source commented 1 year ago

Seeing the error below:

RuntimeError: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments. For more information, see https://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[20], line 2
      1 tic = timer()
----> 2 X_train, y_train, X_infer, y_infer = taxi_data_loader(
      3     client,
      4     adlsaccount="azureopendatastorage",
      5     adlspath="az://nyctlc/yellow/puYear=2014/puMonth=1*/*.parquet",
      6     infer_frac=0.1,
      7     random_state=42,
      8 )
      9 toc = timer()
     10 print(f"Wall clock time taken for ETL and persisting : {toc-tic} s")

Cell In[19], line 95, in taxi_data_loader(client, adlsaccount, adlspath, response_dtype, infer_frac, random_state)
     93 response_id = "fareAmount"
     94 storage_options = {"account_name": adlsaccount}
---> 95 taxi_data = dask_cudf.read_parquet(
     96     adlspath,
     97     storage_options=storage_options,
     98     chunksize=25e6,
     99     npartitions=len(workers),
    100 )
    101 taxi_data = clean(taxi_data, must_haves)
    102 taxi_data = taxi_data.map_partitions(add_features)

File ~/anaconda3/envs/rapids-23.08/lib/python3.10/site-packages/dask_cudf/io/parquet.py:539, in read_parquet(path, columns, **kwargs)
    536         kwargs["read"] = {}
    537     kwargs["read"]["check_file_size"] = check_file_size
--> 539 return dd.read_parquet(path, columns=columns, engine=CudfEngine, **kwargs)

File ~/anaconda3/envs/rapids-23.08/lib/python3.10/site-packages/dask/backends.py:138, in CreationDispatch.register_inplace.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    136     return func(*args, **kwargs)
    137 except Exception as e:
--> 138     raise type(e)(
    139         f"An error occurred while calling the {funcname(func)} "
    140         f"method registered to the {self.backend} backend.\n"
    141         f"Original Message: {e}"
    142     ) from e

RuntimeError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments. For more information, see https://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments
skirui-source commented 11 months ago

Resolved the ForestInference load issue, all cells now working correctly! but will need to clear all outputs as it includes some of my personal info like email.

skirui-source commented 11 months ago

@jacobtomlinson this PR has been ready for another review/merge. please take a look when you can.