Closed m-cappi closed 2 weeks ago
Thanks for the report! I believe this is fixed now, but I'll confirm that when I have a chance to spin up a GPU instance later.
For reference, the storage container holding the model weights was flipped from public to private. The snippet
remote_model = client.submit(load_model)
print(remote_model)
submits that function to run on the worker, but doesn't actually wait for it to finish / verify that it doesn't error. We should probably check remote_model.result()
to make sure there isn't an error.
Hi Tom! Thanks for the feedback!
I just wanted to share that the issue persists. And upon trying to check remote_model.result()
as you suggested, it throws the EOFError('Ran out of input')
. The error does provide a traceback now.
2022-12-02 15:17:46,804 - distributed.worker - WARNING - Compute Failed
Key: load_model-ef8130bf370f8d62e6938783f914325c
Function: load_model
args: ()
kwargs: {}
Exception: "EOFError('Ran out of input')"
---------------------------------------------------------------------------
EOFError Traceback (most recent call last)
Cell In [3], line 2
1 remote_model = client.submit(load_model)
----> 2 res = remote_model.result()
3 print(remote_model)
4 print(res)
File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py:279, in Future.result(self, timeout)
277 if self.status == "error":
278 typ, exc, tb = result
--> 279 raise exc.with_traceback(tb)
280 elif self.status == "cancelled":
281 raise result
Cell In [2], line 31, in load_model()
21 f.write(blob_client.download_blob().readall())
23 model = segmentation_models_pytorch.Unet(
24 encoder_name="resnet18",
25 encoder_depth=3,
(...)
29 classes=13,
30 )
---> 31 model.load_state_dict(torch.load("unet_both_lc.pt", map_location="cuda:0"))
33 device = torch.device("cuda")
34 model = model.to(device)
File /srv/conda/envs/notebook/lib/python3.9/site-packages/torch/serialization.py:713, in load()
711 return torch.jit.load(opened_file)
712 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 713 return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File /srv/conda/envs/notebook/lib/python3.9/site-packages/torch/serialization.py:920, in _legacy_load()
914 if not hasattr(f, 'readinto') and (3, 8, 0) <= sys.version_info < (3, 8, 2):
915 raise RuntimeError(
916 "torch.load does not work with file-like objects that do not implement readinto on Python 3.8.0 and 3.8.1. "
917 f"Received object of type \"{type(f)}\". Please update to Python 3.8.2 or newer to restore this "
918 "functionality.")
--> 920 magic_number = pickle_module.load(f, **pickle_load_args)
921 if magic_number != MAGIC_NUMBER:
922 raise RuntimeError("Invalid magic number; corrupt file?")
EOFError: Ran out of input
@m-cappi thanks for following up. Can you confirm that you're using the "GPU - PyTorch" profile on the Hub? I think you are, otherwise the torch
import would have failed...
My earlier suggestion to use remote_model.result()
wasn't quite right, since that tries to ship the model from the worker process to the client process, which might fail if something isn't serializable with pickle. It's better to just check that the future is finished.
import dask
remote_model = client.submit(load_model)
dask.distributed.wait(remote_model)
assert remote_model.status == "finished"
When I do that, I am able to run the notebook successfully.
yes, I confirm to be using the "GPU - PyTorch" profile
I've tried your assertion, but it keeps circling back to the same error
import dask
remote_model = client.submit(load_model)
print(remote_model)
dask.distributed.wait(remote_model)
print(remote_model)
print(remote_model.status)
assert remote_model.status == "finished"
# Which prints out:
<Future: pending, key: load_model-9ecf5d1788c7271c00f47b87328ef71f>
---------------------------------------------------------------------------
2022-12-02 18:21:14,703 - distributed.worker - WARNING - Compute Failed
Key: load_model-9ecf5d1788c7271c00f47b87328ef71f
Function: load_model
args: ()
kwargs: {}
Exception: "EOFError('Ran out of input')"
---------------------------------------------------------------------------
<Future: error, key: load_model-9ecf5d1788c7271c00f47b87328ef71f>
---------------------------------------------------------------------------
error
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In [3], line 8
6 print(remote_model)
7 print(remote_model.status)
----> 8 assert remote_model.status == "finished"
AssertionError:
And commenting out the assertion allows the notebook to continue with errors, but the original error on cell 6 persists and eventually crashes on cell 13 where the compute
calls on the model with a future.result()
:
---------------------------------------------------------------------------
EOFError Traceback (most recent call last)
Cell In [13], line 1
----> 1 predictions[:, :200, :200].compute()
File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/dataarray.py:993, in DataArray.compute(self, **kwargs)
974 """Manually trigger loading of this array's data from disk or a
975 remote source into memory and return a new array. The original is
976 left unaltered.
(...)
990 dask.compute
991 """
992 new = self.copy(deep=False)
--> 993 return new.load(**kwargs)
File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/dataarray.py:967, in DataArray.load(self, **kwargs)
949 def load(self: T_DataArray, **kwargs) -> T_DataArray:
950 """Manually trigger loading of this array's data from disk or a
951 remote source into memory and return this array.
952
(...)
965 dask.compute
966 """
--> 967 ds = self._to_temp_dataset().load(**kwargs)
968 new = self._from_temp_dataset(ds)
969 self._variable = new._variable
File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/dataset.py:733, in Dataset.load(self, **kwargs)
730 import dask.array as da
732 # evaluate all the dask arrays simultaneously
--> 733 evaluated_data = da.compute(*lazy_data.values(), **kwargs)
735 for k, data in zip(lazy_data, evaluated_data):
736 self.variables[k].data = data
File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask/base.py:575, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
572 keys.append(x.__dask_keys__())
573 postcomputes.append(x.__dask_postcompute__())
--> 575 results = schedule(dsk, keys, **kwargs)
576 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py:3015, in Client.get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
3013 should_rejoin = False
3014 try:
-> 3015 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
3016 finally:
3017 for f in futures.values():
File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py:2167, in Client.gather(self, futures, errors, direct, asynchronous)
2165 else:
2166 local_worker = None
-> 2167 return self.sync(
2168 self._gather,
2169 futures,
2170 errors=errors,
2171 direct=direct,
2172 local_worker=local_worker,
2173 asynchronous=asynchronous,
2174 )
File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py:309, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
307 return future
308 else:
--> 309 return sync(
310 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
311 )
File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py:376, in sync(loop, func, callback_timeout, *args, **kwargs)
374 if error:
375 typ, exc, tb = error
--> 376 raise exc.with_traceback(tb)
377 else:
378 return result
File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py:349, in sync.<locals>.f()
347 future = asyncio.wait_for(future, callback_timeout)
348 future = asyncio.ensure_future(future)
--> 349 result = yield future
350 except Exception:
351 error = sys.exc_info()
File /srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/gen.py:762, in Runner.run(self)
759 exc_info = None
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py:2030, in Client._gather(self, futures, errors, direct, local_worker)
2028 exc = CancelledError(key)
2029 else:
-> 2030 raise exception.with_traceback(traceback)
2031 raise exc
2032 if errors == "skip":
Cell In [2], line 31, in load_model()
21 f.write(blob_client.download_blob().readall())
23 model = segmentation_models_pytorch.Unet(
24 encoder_name="resnet18",
25 encoder_depth=3,
(...)
29 classes=13,
30 )
---> 31 model.load_state_dict(torch.load("unet_both_lc.pt", map_location="cuda:0"))
33 device = torch.device("cuda")
34 model = model.to(device)
File /srv/conda/envs/notebook/lib/python3.9/site-packages/torch/serialization.py:713, in load()
711 return torch.jit.load(opened_file)
712 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 713 return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File /srv/conda/envs/notebook/lib/python3.9/site-packages/torch/serialization.py:920, in _legacy_load()
914 if not hasattr(f, 'readinto') and (3, 8, 0) <= sys.version_info < (3, 8, 2):
915 raise RuntimeError(
916 "torch.load does not work with file-like objects that do not implement readinto on Python 3.8.0 and 3.8.1. "
917 f"Received object of type \"{type(f)}\". Please update to Python 3.8.2 or newer to restore this "
918 "functionality.")
--> 920 magic_number = pickle_module.load(f, **pickle_load_args)
921 if magic_number != MAGIC_NUMBER:
922 raise RuntimeError("Invalid magic number; corrupt file?")
EOFError: Ran out of input
Im trying the same tutorial, but cant figure out how to install dask:
The hub is no longer available.
Hi!
While working with the Land Cover Classification tutorial, I've found a recurring
"EOFError('Ran out of input')"
when running the Aligning images block 6. This happens while running in the GPU - PyTorch VM profile, with the run all command.Prior to this block, there are no errors aside from the following info log showing up on an error box after block 1.