scottyhq commented 2 years ago

The landcover.ipynb notebook example is amazing. Thanks @TomAugspurger for putting it together!

I'm fairly new to pytorch and GPUs and am encountering tracebacks in the default environment perhaps related to version changes.

remote_model = client.scatter(model, broadcast=True)

(abbreviated traceback):

TypeError: Could not serialize object of type Tensor.
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 340, in serialize
    header, frames = dumps(x, context=context) if wants_context else dumps(x)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 43, in dask_dumps
    sub_header, frames = dumps(x)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/protocol/torch.py", line 19, in serialize_torch_Tensor
    sub_header, frames = serialize(t.numpy())
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Naively I tried remote_model = client.scatter(model.cpu(), broadcast=True) which runs (but would that not take advantage of GPU?) , but then run into the following with predictions[:, :200, :200].compute()

distributed.worker - WARNING - Compute Failed
Function:  execute_task
kwargs:    {}
Exception: "RuntimeError('Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same')"

TomAugspurger commented 2 years ago

Can you confirm whether you're using the default Python environment or the gpu-pytorch environment?

pytorch did just have a release yesterday, which might be causing issues.

scottyhq commented 2 years ago

Can you confirm whether you're using the default Python environment or the gpu-pytorch environment?

JUPYTER_IMAGE_SPEC=pcccr.azurecr.io/public/planetary-computer/gpu-pytorch:2022.02.14.0

(notebook) jovyan@jupyter-scottyh-40uw-2eedu:~$ conda list | grep dask
dask                      2021.11.2          pyhd8ed1ab_0    conda-forge
dask-core                 2021.11.2          pyhd8ed1ab_0    conda-forge
dask-cuda                 21.10.0a210813           pypi_0    pypi
dask-gateway              0.9.0            py38h578d9bd_2    conda-forge
dask-geopandas            0.1.0a5                  pypi_0    pypi
dask-glm                  0.2.0                      py_1    conda-forge
dask-image                2021.12.0          pyhd8ed1ab_0    conda-forge
dask-kubernetes           2021.10.0          pyhd8ed1ab_0    conda-forge
dask-labextension         5.1.0              pyhd8ed1ab_1    conda-forge
dask-ml                   2022.1.22          pyhd8ed1ab_0    conda-forge
pangeo-dask               2021.11.22           hd8ed1ab_0    conda-forge
(notebook) jovyan@jupyter-scottyh-40uw-2eedu:~$ conda list | grep torch
efficientnet-pytorch      0.6.3              pyh9f0ad1d_0    conda-forge
pytorch                   1.10.2          cuda102py38h9fb240c_0    conda-forge
pytorch-gpu               1.10.2          cuda102py38hf05f184_0    conda-forge
pytorch-lightning         1.5.9              pyhd8ed1ab_0    conda-forge
segmentation-models-pytorch 0.2.1              pyhd8ed1ab_0    conda-forge
torchgeo                  0.2.0              pyhd8ed1ab_0    conda-forge
torchmetrics              0.7.2              pyhd8ed1ab_0    conda-forge
torchvision               0.10.1          py38cuda102h1e64cea_0_cuda    conda-forge

mjigmond commented 2 years ago

I ran into the same issue. I disabled the scatter, but later on I ran out of memory at dask.compute(*parts). RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 15.75 GiB total capacity; 11.81 GiB already allocated; 918.62 MiB free; 13.18 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

TomAugspurger commented 2 years ago

OK I had a chance to look at this today.

Most likely, something in Torch changed how they serialize Tensors with pickle. I'm not sure if that was intentional or not.

I'll update the notebook to load the model into the workers directly, rather than going through a client first.

I also noticed that stac-vrt was missing from the environment. Not sure how that happened, but I'll need to update that too.

mjigmond commented 2 years ago

Thank you @TomAugspurger, yes, I had added a !pip install stac_vrt and forgot to mention it. I also worked around the memory issue by decreasing the size of the output image.

TomAugspurger commented 2 years ago

Glad to hear it.

planetary-computer-containers, but that'll take a bit longer.

microsoft / PlanetaryComputerExamples

Landcover.ipynb Exception: TypeError('Could not serialize object of type Tensor...') in #131

135 fixed the serialization issue by updating the notebook. I'll get the images updated in https://github.com/microsoft/planetary-computer-containers, but that'll take a bit longer.