Migrate `azure/notebooks/Azure-MNMG-XGBoost.ipynb` to deployment docs

skirui-source commented 1 year ago

Fixes #211 - migration of azure_mnmg_daskcloudprovider notebook

See #203 (comment) for detailed migration instructions.

Tasks

[x] Decide if notebook should be migrated and add "migrate: X" label (if no also close this issue)
[x] Test if notebook works
[x] Fix up anything that needs changing
[x] Ensure notebook has good title, description and metadata tags in the first cell
[x] Replace deployment instructions with links to docs pages
[x] Copy notebook into a new folder into deployment docs examples
[x] Copy any supporting files to the folder
[x] Add notebook to examples toctree
[x] Make PR to deployment docs repo

review-notebook-app[bot] commented 1 year ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

skirui-source commented 1 year ago

Seeing the error below:

RuntimeError: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments. For more information, see https://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[20], line 2
      1 tic = timer()
----> 2 X_train, y_train, X_infer, y_infer = taxi_data_loader(
      3     client,
      4     adlsaccount="azureopendatastorage",
      5     adlspath="az://nyctlc/yellow/puYear=2014/puMonth=1*/*.parquet",
      6     infer_frac=0.1,
      7     random_state=42,
      8 )
      9 toc = timer()
     10 print(f"Wall clock time taken for ETL and persisting : {toc-tic} s")

Cell In[19], line 95, in taxi_data_loader(client, adlsaccount, adlspath, response_dtype, infer_frac, random_state)
     93 response_id = "fareAmount"
     94 storage_options = {"account_name": adlsaccount}
---> 95 taxi_data = dask_cudf.read_parquet(
     96     adlspath,
     97     storage_options=storage_options,
     98     chunksize=25e6,
     99     npartitions=len(workers),
    100 )
    101 taxi_data = clean(taxi_data, must_haves)
    102 taxi_data = taxi_data.map_partitions(add_features)

File ~/anaconda3/envs/rapids-23.08/lib/python3.10/site-packages/dask_cudf/io/parquet.py:539, in read_parquet(path, columns, **kwargs)
    536         kwargs["read"] = {}
    537     kwargs["read"]["check_file_size"] = check_file_size
--> 539 return dd.read_parquet(path, columns=columns, engine=CudfEngine, **kwargs)

File ~/anaconda3/envs/rapids-23.08/lib/python3.10/site-packages/dask/backends.py:138, in CreationDispatch.register_inplace.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    136     return func(*args, **kwargs)
    137 except Exception as e:
--> 138     raise type(e)(
    139         f"An error occurred while calling the {funcname(func)} "
    140         f"method registered to the {self.backend} backend.\n"
    141         f"Original Message: {e}"
    142     ) from e

RuntimeError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments. For more information, see https://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments

skirui-source commented 11 months ago

Resolved the ForestInference load issue, all cells now working correctly! but will need to clear all outputs as it includes some of my personal info like email.

skirui-source commented 11 months ago

@jacobtomlinson this PR has been ready for another review/merge. please take a look when you can.

rapidsai / deployment

Migrate `azure/notebooks/Azure-MNMG-XGBoost.ipynb` to deployment docs #253

Tasks