rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.23k stars 883 forks source link

[BUG] install Rapid framework with AWS CLUSTER VIA DASK: Mismatched versions found #11766

Closed parkerzf closed 1 year ago

parkerzf commented 1 year ago

Describe the bug Follow the instruction on https://rapids.ai/cloud.html#aws, section AWS CLUSTER VIA DASK, I get the Mismatched versions found error. Is this deployment approach outdated?

Steps/Code to reproduce bug Client node: p3.2xlarge, installed rapid framework using the latest docker command:

docker pull rapidsai/rapidsai-dev:22.08-cuda11.5-devel-ubuntu20.04-py3.9
docker run --gpus all --rm -it \
    --shm-size=1g --ulimit memlock=-1 \
    -p 8888:8888 -p 8787:8787 -p 8786:8786 \
    rapidsai/rapidsai-dev:22.08-cuda11.5-devel-ubuntu20.04-py3.9

ECS cluster: 1 instance p3.2xlarge, all the steps are correct and the cluster is created successfully. However, with following code it shows the error:

from dask.distributed import Client
client = Client(cluster)

import dask, cudf, dask_cudf
ddf = dask.datasets.timeseries()
gdf = ddf.map_partitions(cudf.from_pandas)
gdf.groupby(‘name’).id.count().compute().head()

Error:

CancelledError: ('getitem-fe751d5251afed463bc49ee3f67eabe0', 0)/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/client.py:1274: VersionMismatchWarning: Mismatched versions found

+-------------+----------------+----------------+----------------+
| Package     | client         | scheduler      | workers        |
+-------------+----------------+----------------+----------------+
| blosc       | MISSING        | None           | None           |
| cloudpickle | 2.1.0          | 1.6.0          | 1.6.0          |
| dask        | 2022.7.1       | 2021.04.0      | 2021.04.0      |
| distributed | 2022.7.1       | 2021.04.0      | 2021.04.0      |
| lz4         | 4.0.0          | None           | None           |
| msgpack     | 1.0.4          | 1.0.2          | 1.0.2          |
| numpy       | 1.22.4         | 1.20.2         | 1.20.2         |
| pandas      | 1.4.3          | MISSING        | MISSING        |
| python      | 3.9.13.final.0 | 3.7.10.final.0 | 3.7.10.final.0 |
| toolz       | 0.12.0         | 0.11.1         | 0.11.1         |
+-------------+----------------+----------------+----------------+
Notes:
-  msgpack: Variation is ok, as long as everything is above 0.6
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))

Expected behavior A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

beckernick commented 1 year ago

cc @mmccarty @jacobtomlinson , in case you've run into this recently during recent cloud deployment work

jacobtomlinson commented 1 year ago

Could you share how you are creating the cluster object?

parkerzf commented 1 year ago

Yes of course. I followed this blog post mostly:

  1. Create a ECS cluster with one p3.2xlarge instance.
  2. Create another p3.2xlarge instance in the same VPC as the ECS cluster, as the client.
  3. Use the docker command to install rapids frameworks in the client.
  4. Install dask_cloudprovider lib and run the python script to start a dask cluster in the client. Everything is fine so far.
  5. Test RAPIDS in the client using the python script. Now I get the version mismatched issue.

I guess it is because the dask_cloudprovider is outdated, which uses much older version of python packages, comparing to the client I installed using the docker command.

jacobtomlinson commented 1 year ago

Here's my comment on the duplicate over on https://github.com/rapidsai/cugraph/issues/2734

We are in the process of migrating the https://rapids.ai/cloud.html#aws page to https://docs.rapids.ai/deployment which will contain more up to date and rich documentation. But AFAIK the instructions you mentioned should be more or less up to date.

If you are following the dask-cloudprovider related instructions I expect you need to ensure you are setting the docker image option in ECSCluster to match the docker image you are using locally, this will be why you're getting a version mismatch error in your Python environment.

The LocalCUDACluster version that @taureandyernv shared will be working because everything is happening on your p3.2xlarge instance inside the RAPIDS container so all versions are consistent. The dask-cloudprovider instructions allow you to burst beyond the p3.2xlarge instance and provision more GPU nodes on ECS, so you need to ensure the environment provisioned on those additional nodes matches the one on your instance.

A quick question before we dig too deep into figuring out your deployment problem is do you need to burst beyond your EC2 instance onto ECS, or would choosing a larger EC2 instance and sticking with LocalCUDACluster be suitable and cause less friction?

jacobtomlinson commented 1 year ago

I guess it is because the dask_cloudprovider is outdated

I don't think this is because it is outdated, I think it's because the defaults use the ghcr.io/dask/dask:latest Docker image and you need to specify the RAPIDS one like you did when starting up the p3.2xlarge instance.

parkerzf commented 1 year ago

Hey @jacobtomlinson Thanks for the reply! Actually I am struggling to load large data into rapid frameworks. Here are the tickets: https://github.com/rapidsai/cudf/issues/11796 and https://github.com/rapidsai/cugraph/issues/2694. That's why I am trying to use the GPU cluster with more GPU memory. Also we will run on much big graph later so setting up GPU cluster is necessary.

Back to this question, according to the code, the default image is rapidsai/rapidsai:latest, which should be the same as what I installed in the client. However, it seems that the older version is installed in the scheduler/worker.

jacobtomlinson commented 1 year ago

You should be using the same Docker image that you are using locally, so if your local docker run uses rapidsai/rapidsai-dev:22.08-cuda11.5-devel-ubuntu20.04-py3.9 then your cloudprovider call should use the same.

GregoryKimball commented 1 year ago

Please feel free to create a new issue if you run into more trouble!