microsoft / PlanetaryComputer

Issues, discussions, and information about the Microsoft Planetary Computer
https://planetarycomputer.microsoft.com/
MIT License
180 stars 7 forks source link

Broken tutorial: Land Cover Classification #142

Open m-cappi opened 1 year ago

m-cappi commented 1 year ago

Hi!

While working with the Land Cover Classification tutorial, I've found a recurring "EOFError('Ran out of input')" when running the Aligning images block 6. This happens while running in the GPU - PyTorch VM profile, with the run all command.

2022-12-01 18:32:48,396 - distributed.worker - WARNING - Compute Failed
Key:       load_model-7b38892c0930e4df99b829abc6110f2b
Function:  load_model
args:      ()
kwargs:    {}
Exception: "EOFError('Ran out of input')"

Prior to this block, there are no errors aside from the following info log showing up on an error box after block 1.

2022-12-01 18:50:27,341 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
TomAugspurger commented 1 year ago

Thanks for the report! I believe this is fixed now, but I'll confirm that when I have a chance to spin up a GPU instance later.

For reference, the storage container holding the model weights was flipped from public to private. The snippet

remote_model = client.submit(load_model)
print(remote_model)

submits that function to run on the worker, but doesn't actually wait for it to finish / verify that it doesn't error. We should probably check remote_model.result() to make sure there isn't an error.

m-cappi commented 1 year ago

Hi Tom! Thanks for the feedback!

I just wanted to share that the issue persists. And upon trying to check remote_model.result() as you suggested, it throws the EOFError('Ran out of input'). The error does provide a traceback now.

2022-12-02 15:17:46,804 - distributed.worker - WARNING - Compute Failed
Key:       load_model-ef8130bf370f8d62e6938783f914325c
Function:  load_model
args:      ()
kwargs:    {}
Exception: "EOFError('Ran out of input')"

---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
Cell In [3], line 2
      1 remote_model = client.submit(load_model)
----> 2 res = remote_model.result()
      3 print(remote_model)
      4 print(res)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py:279, in Future.result(self, timeout)
    277 if self.status == "error":
    278     typ, exc, tb = result
--> 279     raise exc.with_traceback(tb)
    280 elif self.status == "cancelled":
    281     raise result

Cell In [2], line 31, in load_model()
     21         f.write(blob_client.download_blob().readall())
     23 model = segmentation_models_pytorch.Unet(
     24     encoder_name="resnet18",
     25     encoder_depth=3,
   (...)
     29     classes=13,
     30 )
---> 31 model.load_state_dict(torch.load("unet_both_lc.pt", map_location="cuda:0"))
     33 device = torch.device("cuda")
     34 model = model.to(device)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/torch/serialization.py:713, in load()
    711             return torch.jit.load(opened_file)
    712         return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 713 return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/torch/serialization.py:920, in _legacy_load()
    914 if not hasattr(f, 'readinto') and (3, 8, 0) <= sys.version_info < (3, 8, 2):
    915     raise RuntimeError(
    916         "torch.load does not work with file-like objects that do not implement readinto on Python 3.8.0 and 3.8.1. "
    917         f"Received object of type \"{type(f)}\". Please update to Python 3.8.2 or newer to restore this "
    918         "functionality.")
--> 920 magic_number = pickle_module.load(f, **pickle_load_args)
    921 if magic_number != MAGIC_NUMBER:
    922     raise RuntimeError("Invalid magic number; corrupt file?")

EOFError: Ran out of input
TomAugspurger commented 1 year ago

@m-cappi thanks for following up. Can you confirm that you're using the "GPU - PyTorch" profile on the Hub? I think you are, otherwise the torch import would have failed...

My earlier suggestion to use remote_model.result() wasn't quite right, since that tries to ship the model from the worker process to the client process, which might fail if something isn't serializable with pickle. It's better to just check that the future is finished.

import dask

remote_model = client.submit(load_model)
dask.distributed.wait(remote_model)
assert remote_model.status == "finished"

When I do that, I am able to run the notebook successfully.

m-cappi commented 1 year ago

yes, I confirm to be using the "GPU - PyTorch" profile

I've tried your assertion, but it keeps circling back to the same error

import dask

remote_model = client.submit(load_model)
print(remote_model)
dask.distributed.wait(remote_model)
print(remote_model)
print(remote_model.status)
assert remote_model.status == "finished"

# Which prints out:
<Future: pending, key: load_model-9ecf5d1788c7271c00f47b87328ef71f>
---------------------------------------------------------------------------
2022-12-02 18:21:14,703 - distributed.worker - WARNING - Compute Failed
Key:       load_model-9ecf5d1788c7271c00f47b87328ef71f
Function:  load_model
args:      ()
kwargs:    {}
Exception: "EOFError('Ran out of input')"
---------------------------------------------------------------------------
<Future: error, key: load_model-9ecf5d1788c7271c00f47b87328ef71f>
---------------------------------------------------------------------------
error
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In [3], line 8
      6 print(remote_model)
      7 print(remote_model.status)
----> 8 assert remote_model.status == "finished"

AssertionError: 

And commenting out the assertion allows the notebook to continue with errors, but the original error on cell 6 persists and eventually crashes on cell 13 where the compute calls on the model with a future.result():

---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
Cell In [13], line 1
----> 1 predictions[:, :200, :200].compute()

File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/dataarray.py:993, in DataArray.compute(self, **kwargs)
    974 """Manually trigger loading of this array's data from disk or a
    975 remote source into memory and return a new array. The original is
    976 left unaltered.
   (...)
    990 dask.compute
    991 """
    992 new = self.copy(deep=False)
--> 993 return new.load(**kwargs)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/dataarray.py:967, in DataArray.load(self, **kwargs)
    949 def load(self: T_DataArray, **kwargs) -> T_DataArray:
    950     """Manually trigger loading of this array's data from disk or a
    951     remote source into memory and return this array.
    952 
   (...)
    965     dask.compute
    966     """
--> 967     ds = self._to_temp_dataset().load(**kwargs)
    968     new = self._from_temp_dataset(ds)
    969     self._variable = new._variable

File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/dataset.py:733, in Dataset.load(self, **kwargs)
    730 import dask.array as da
    732 # evaluate all the dask arrays simultaneously
--> 733 evaluated_data = da.compute(*lazy_data.values(), **kwargs)
    735 for k, data in zip(lazy_data, evaluated_data):
    736     self.variables[k].data = data

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask/base.py:575, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    572     keys.append(x.__dask_keys__())
    573     postcomputes.append(x.__dask_postcompute__())
--> 575 results = schedule(dsk, keys, **kwargs)
    576 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py:3015, in Client.get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   3013         should_rejoin = False
   3014 try:
-> 3015     results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   3016 finally:
   3017     for f in futures.values():

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py:2167, in Client.gather(self, futures, errors, direct, asynchronous)
   2165 else:
   2166     local_worker = None
-> 2167 return self.sync(
   2168     self._gather,
   2169     futures,
   2170     errors=errors,
   2171     direct=direct,
   2172     local_worker=local_worker,
   2173     asynchronous=asynchronous,
   2174 )

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py:309, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    307     return future
    308 else:
--> 309     return sync(
    310         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    311     )

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py:376, in sync(loop, func, callback_timeout, *args, **kwargs)
    374 if error:
    375     typ, exc, tb = error
--> 376     raise exc.with_traceback(tb)
    377 else:
    378     return result

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py:349, in sync.<locals>.f()
    347         future = asyncio.wait_for(future, callback_timeout)
    348     future = asyncio.ensure_future(future)
--> 349     result = yield future
    350 except Exception:
    351     error = sys.exc_info()

File /srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/gen.py:762, in Runner.run(self)
    759 exc_info = None
    761 try:
--> 762     value = future.result()
    763 except Exception:
    764     exc_info = sys.exc_info()

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py:2030, in Client._gather(self, futures, errors, direct, local_worker)
   2028         exc = CancelledError(key)
   2029     else:
-> 2030         raise exception.with_traceback(traceback)
   2031     raise exc
   2032 if errors == "skip":

Cell In [2], line 31, in load_model()
     21         f.write(blob_client.download_blob().readall())
     23 model = segmentation_models_pytorch.Unet(
     24     encoder_name="resnet18",
     25     encoder_depth=3,
   (...)
     29     classes=13,
     30 )
---> 31 model.load_state_dict(torch.load("unet_both_lc.pt", map_location="cuda:0"))
     33 device = torch.device("cuda")
     34 model = model.to(device)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/torch/serialization.py:713, in load()
    711             return torch.jit.load(opened_file)
    712         return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 713 return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/torch/serialization.py:920, in _legacy_load()
    914 if not hasattr(f, 'readinto') and (3, 8, 0) <= sys.version_info < (3, 8, 2):
    915     raise RuntimeError(
    916         "torch.load does not work with file-like objects that do not implement readinto on Python 3.8.0 and 3.8.1. "
    917         f"Received object of type \"{type(f)}\". Please update to Python 3.8.2 or newer to restore this "
    918         "functionality.")
--> 920 magic_number = pickle_module.load(f, **pickle_load_args)
    921 if magic_number != MAGIC_NUMBER:
    922     raise RuntimeError("Invalid magic number; corrupt file?")

EOFError: Ran out of input
heksam commented 5 months ago

Im trying the same tutorial, but cant figure out how to install dask: image