Closed scottyhq closed 2 years ago
Can you confirm whether you're using the default Python environment or the gpu-pytorch environment?
pytorch did just have a release yesterday, which might be causing issues.
Can you confirm whether you're using the default Python environment or the gpu-pytorch environment?
JUPYTER_IMAGE_SPEC=pcccr.azurecr.io/public/planetary-computer/gpu-pytorch:2022.02.14.0
(notebook) jovyan@jupyter-scottyh-40uw-2eedu:~$ conda list | grep dask
dask 2021.11.2 pyhd8ed1ab_0 conda-forge
dask-core 2021.11.2 pyhd8ed1ab_0 conda-forge
dask-cuda 21.10.0a210813 pypi_0 pypi
dask-gateway 0.9.0 py38h578d9bd_2 conda-forge
dask-geopandas 0.1.0a5 pypi_0 pypi
dask-glm 0.2.0 py_1 conda-forge
dask-image 2021.12.0 pyhd8ed1ab_0 conda-forge
dask-kubernetes 2021.10.0 pyhd8ed1ab_0 conda-forge
dask-labextension 5.1.0 pyhd8ed1ab_1 conda-forge
dask-ml 2022.1.22 pyhd8ed1ab_0 conda-forge
pangeo-dask 2021.11.22 hd8ed1ab_0 conda-forge
(notebook) jovyan@jupyter-scottyh-40uw-2eedu:~$ conda list | grep torch
efficientnet-pytorch 0.6.3 pyh9f0ad1d_0 conda-forge
pytorch 1.10.2 cuda102py38h9fb240c_0 conda-forge
pytorch-gpu 1.10.2 cuda102py38hf05f184_0 conda-forge
pytorch-lightning 1.5.9 pyhd8ed1ab_0 conda-forge
segmentation-models-pytorch 0.2.1 pyhd8ed1ab_0 conda-forge
torchgeo 0.2.0 pyhd8ed1ab_0 conda-forge
torchmetrics 0.7.2 pyhd8ed1ab_0 conda-forge
torchvision 0.10.1 py38cuda102h1e64cea_0_cuda conda-forge
I ran into the same issue. I disabled the scatter, but later on I ran out of memory at dask.compute(*parts)
.
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 15.75 GiB total capacity; 11.81 GiB already allocated; 918.62 MiB free; 13.18 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
OK I had a chance to look at this today.
Most likely, something in Torch changed how they serialize Tensors with pickle. I'm not sure if that was intentional or not.
I'll update the notebook to load the model into the workers directly, rather than going through a client first.
I also noticed that stac-vrt was missing from the environment. Not sure how that happened, but I'll need to update that too.
Thank you @TomAugspurger, yes, I had added a !pip install stac_vrt
and forgot to mention it. I also worked around the memory issue by decreasing the size of the output image.
Glad to hear it.
The landcover.ipynb notebook example is amazing. Thanks @TomAugspurger for putting it together!
I'm fairly new to pytorch and GPUs and am encountering tracebacks in the default environment perhaps related to version changes.
(abbreviated traceback):
Naively I tried
remote_model = client.scatter(model.cpu(), broadcast=True)
which runs (but would that not take advantage of GPU?) , but then run into the following withpredictions[:, :200, :200].compute()