pytorch / tnt

A lightweight library for PyTorch training tools and utilities
https://pytorch.org/tnt/
Other
1.66k stars 271 forks source link

_validate_snapshot_available() failing although torchsnapshot is available #876

Open nubertj opened 2 months ago

nubertj commented 2 months ago

🐛 Describe the bug

When running my code with torchtnt and the TorchSnapshotSaver (torchsnapshot_saver.py), I get the following error after construction of the class:

RuntimeError: TorchSnapshotSaver support requires torchsnapshot. Please make sure ``torchsnapshot`` is installed. Installation: https://github.com/pytorch/torchsnapshot#install

This line can be found here. However, torchsnapshot can be imported.

Versions

I tried installing torchsnapshot and torchtnt from conda, pypi, and directly from the github repos. I always get this result.

elrnv commented 3 weeks ago

I also ran into this. It seems that torchsnapshot_saver.py is importing override_max_per_rank_io_concurrency from torchsnapshot.knobs, which is only available on the main branch and not in the 0.1.0 release. Perhaps the simplest solution is to release another version of torchsnapshot, and constraint torchtnt to depend on that.

Edit: In the short term, installing torchsnapshot with pip install --pre torchsnapshot-nightly worked for me.