Closed parkerzf closed 1 year ago
cc @mmccarty @jacobtomlinson , in case you've run into this recently during recent cloud deployment work
Could you share how you are creating the cluster
object?
Yes of course. I followed this blog post mostly:
p3.2xlarge
instance.p3.2xlarge
instance in the same VPC as the ECS cluster, as the client
.client
.dask_cloudprovider
lib and run the python script to start a dask cluster in the client
. Everything is fine so far.client
using the python script. Now I get the version mismatched issue.I guess it is because the dask_cloudprovider is outdated, which uses much older version of python packages, comparing to the client I installed using the docker command.
Here's my comment on the duplicate over on https://github.com/rapidsai/cugraph/issues/2734
We are in the process of migrating the https://rapids.ai/cloud.html#aws page to https://docs.rapids.ai/deployment which will contain more up to date and rich documentation. But AFAIK the instructions you mentioned should be more or less up to date.
If you are following the dask-cloudprovider related instructions I expect you need to ensure you are setting the docker image option in ECSCluster to match the docker image you are using locally, this will be why you're getting a version mismatch error in your Python environment.
The LocalCUDACluster version that @taureandyernv shared will be working because everything is happening on your p3.2xlarge instance inside the RAPIDS container so all versions are consistent. The dask-cloudprovider instructions allow you to burst beyond the p3.2xlarge instance and provision more GPU nodes on ECS, so you need to ensure the environment provisioned on those additional nodes matches the one on your instance.
A quick question before we dig too deep into figuring out your deployment problem is do you need to burst beyond your EC2 instance onto ECS, or would choosing a larger EC2 instance and sticking with LocalCUDACluster be suitable and cause less friction?
I guess it is because the dask_cloudprovider is outdated
I don't think this is because it is outdated, I think it's because the defaults use the ghcr.io/dask/dask:latest
Docker image and you need to specify the RAPIDS one like you did when starting up the p3.2xlarge
instance.
Hey @jacobtomlinson Thanks for the reply! Actually I am struggling to load large data into rapid frameworks. Here are the tickets: https://github.com/rapidsai/cudf/issues/11796 and https://github.com/rapidsai/cugraph/issues/2694. That's why I am trying to use the GPU cluster with more GPU memory. Also we will run on much big graph later so setting up GPU cluster is necessary.
Back to this question, according to the code, the default image is rapidsai/rapidsai:latest
, which should be the same as what I installed in the client. However, it seems that the older version is installed in the scheduler/worker.
You should be using the same Docker image that you are using locally, so if your local docker run
uses rapidsai/rapidsai-dev:22.08-cuda11.5-devel-ubuntu20.04-py3.9
then your cloudprovider call should use the same.
Please feel free to create a new issue if you run into more trouble!
Describe the bug Follow the instruction on https://rapids.ai/cloud.html#aws, section AWS CLUSTER VIA DASK, I get the Mismatched versions found error. Is this deployment approach outdated?
Steps/Code to reproduce bug Client node:
p3.2xlarge
, installed rapid framework using the latest docker command:ECS cluster: 1 instance
p3.2xlarge
, all the steps are correct and the cluster is created successfully. However, with following code it shows the error:Error:
Expected behavior A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
docker pull
&docker run
commands used