qubvel-org / segmentation_models.pytorch

Semantic segmentation models with 500+ pretrained convolutional and transformer-based backbones.
https://smp.readthedocs.io/
MIT License
9.76k stars 1.68k forks source link

Error running binary_segmentation_intro.ipynb #633

Closed robmarkcole closed 2 years ago

robmarkcole commented 2 years ago

On sagemaker studio, ml.g4dn.xlarge instance and pytorch 1.10 kernel, the notebook raises an error at trainer.fit:

/opt/conda/lib/python3.8/site-packages/torch/_utils.py in reraise(self)
    432             # instantiate since we don't know how to
    433             raise RuntimeError(msg) from None
--> 434         raise exception
    435 
    436 

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 295, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    return self.collate_fn(data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/auto_restart.py", line 474, in _capture_metadata_collate
    data = default_collate(samples)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 74, in <dictcomp>
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 64, in default_collate
    return default_collate([torch.as_tensor(b) for b in batch])
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 54, in default_collate
    storage = elem.storage()._new_shared(numel)
  File "/opt/conda/lib/python3.8/site-packages/torch/storage.py", line 157, in _new_shared
    return cls._new_using_fd(size)
RuntimeError: falseINTERNAL ASSERT FAILED at "/codebuild/output/src741569495/src/aten/src/ATen/MapAllocator.cpp":300, please report a bug to PyTorch. unable to write to file </torch_1110_0>
robmarkcole commented 2 years ago

Appears to be https://github.com/pytorch/pytorch/issues/68501

Now forcing upgrade from 1.10.2+cu113 Appears some conflict:

HorovodVersionMismatchError: Framework pytorch installed with version 1.10.2+cu113 but found version 1.12.1+cu102.
             This can result in unexpected behavior including runtime errors.
             Reinstall Horovod using `pip install --no-cache-dir` to build with the new version.

Getting this Horovod error even with this recipe:

!pip install torch --upgrade
!pip install segmentation-models-pytorch
!pip install pytorch-lightning==1.5.4
!pip install --no-cache-dir horovod[pytorch]
github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 2 years ago

This issue was closed because it has been stalled for 7 days with no activity.