Open fabiogeraci opened 3 days ago
Hey @fabiogeraci - taking a look at this. Are you using a distributed or single device recipe?
Single at the moment
Got it! It looks like you might be hitting an import error, which is causing the TensorboardLogger to incorrectly shut down b/c it doesn't have time to setup the actual writer (self._writer
) on the class. This happened in our WandBLogger too #1322.
I've created a PR that might fix this #2092; however, there's a good chance of some earlier error that might help you resolve this faster. Perhaps looking into AttributeError: module 'tensorflow' has no attribute 'io'
? When I Google that, I get taken to this StackOverflow page that has some suggestions.
In theory it should use from torch.utils.tensorboard import SummaryWriter
. could it be an issue witht he torch version?
I have the following dependency versions:
(test) [jrcummings@devvm4767.pnb0 ~/projects]$ pip list
Package Version
------------------------ ----------
absl-py 2.1.0
filelock 3.16.1
fsspec 2024.10.0
grpcio 1.68.0
Jinja2 3.1.4
Markdown 3.7
MarkupSafe 3.0.2
mpmath 1.3.0
networkx 3.4.2
numpy 2.1.3
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
packaging 24.2
pip 24.2
protobuf 5.29.0
setuptools 75.1.0
six 1.16.0
sympy 1.13.1
tensorboard 2.18.0
tensorboard-data-server 0.7.2
torch 2.5.1
triton 3.1.0
typing_extensions 4.12.2
Werkzeug 3.1.3
wheel 0.44.0
And I am able to run the following with no problems:
(test) [jrcummings@devvm4767.pnb0 ~/projects]$ python
Python 3.11.10 (main, Oct 3 2024, 07:29:13) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from torch.utils.tensorboard import SummaryWriter
>>>
Maybe a conflicting dependency with tensorflow as pointed out in the above StackOverflow post?
mine
poetry show
absl-py 2.1.0 Abseil Python Common Libraries, see https://github.com/abseil/abseil-py.
aiohappyeyeballs 2.4.3 Happy Eyeballs for asyncio
aiohttp 3.11.8 Async http client/server framework (asyncio)
aiosignal 1.3.1 aiosignal: a list of registered asynchronous callbacks
alembic 1.14.0 A database migration tool for SQLAlchemy.
antlr4-python3-runtime 4.9.3 ANTLR 4.9.3 runtime for Python 3.7
attrs 24.2.0 Classes Without Boilerplate
bitsandbytes 0.44.1 k-bit optimizers and matrix multiplication routines.
blinker 1.9.0 Fast, simple object-to-object and broadcast signaling
blobfile 3.0.0 Read GCS, ABS and local paths with the same interface, clone of tensorflow.io.gfile
cachetools 5.5.0 Extensible memoizing collections and decorators
certifi 2024.8.30 Python package for providing Mozilla's CA Bundle.
charset-normalizer 3.4.0 The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.
click 8.1.7 Composable command line interface toolkit
cloudpickle 3.1.0 Pickler class to extend the standard pickle.Pickler functionality
contourpy 1.3.1 Python library for calculating contours of 2D quadrilateral grids
cycler 0.12.1 Composable style cycles
databricks-sdk 0.38.0 Databricks SDK for Python (Beta)
datasets 3.1.0 HuggingFace community-driven open-source library of datasets
deprecated 1.2.15 Python @deprecated decorator to deprecate old python classes, functions or methods.
dill 0.3.8 serialize all of Python
docker 7.1.0 A Python library for the Docker Engine API.
filelock 3.16.1 A platform independent file lock.
flask 3.1.0 A simple framework for building complex web applications.
fonttools 4.55.0 Tools to manipulate font files
frozenlist 1.5.0 A list-like structure which implements collections.abc.MutableSequence
fsspec 2024.9.0 File-system specification
gitdb 4.0.11 Git Object Database
gitpython 3.1.43 GitPython is a Python library used to interact with Git repositories
google-auth 2.36.0 Google Authentication Library
graphene 3.4.3 GraphQL Framework for Python
graphql-core 3.2.5 GraphQL implementation for Python, a port of GraphQL.js, the JavaScript reference implementation for GraphQL.
graphql-relay 3.2.0 Relay library for graphql-core
greenlet 3.1.1 Lightweight in-process concurrent programming
grpcio 1.68.0 HTTP/2-based RPC framework
gunicorn 23.0.0 WSGI HTTP Server for UNIX
huggingface-hub 0.26.3 Client library to download and publish models, datasets and other repos on the huggingface.co hub
idna 3.10 Internationalized Domain Names in Applications (IDNA)
importlib-metadata 8.5.0 Read metadata from Python packages
iniconfig 2.0.0 brain-dead simple config-ini parsing
itsdangerous 2.2.0 Safely pass data to untrusted environments and back.
jinja2 3.1.4 A very fast and expressive template engine.
joblib 1.4.2 Lightweight pipelining with Python functions
kiwisolver 1.4.7 A fast implementation of the Cassowary constraint solver
loguru 0.7.2 Python logging made (stupidly) simple
lxml 5.3.0 Powerful and Pythonic XML processing library combining libxml2/libxslt with the ElementTree API.
mako 1.3.6 A super-fast templating language that borrows the best ideas from the existing templating languages.
markdown 3.7 Python implementation of John Gruber's Markdown.
markupsafe 3.0.2 Safely add untrusted strings to HTML/XML markup.
matplotlib 3.9.2 Python plotting package
mlflow 2.18.0 MLflow is an open source platform for the complete machine learning lifecycle
mlflow-skinny 2.18.0 MLflow is an open source platform for the complete machine learning lifecycle
mpmath 1.3.0 Python library for arbitrary-precision floating-point arithmetic
multidict 6.1.0 multidict implementation
multiprocess 0.70.16 better multiprocessing and multithreading in Python
networkx 3.4.2 Python package for creating and manipulating graphs and networks
numpy 2.1.3 Fundamental package for array computing in Python
nvidia-cublas-cu12 12.1.3.1 CUBLAS native runtime libraries
nvidia-cuda-cupti-cu12 12.1.105 CUDA profiling tools runtime libs.
nvidia-cuda-nvrtc-cu12 12.1.105 NVRTC native runtime libraries
nvidia-cuda-runtime-cu12 12.1.105 CUDA Runtime native Libraries
nvidia-cudnn-cu12 9.1.0.70 cuDNN runtime libraries
nvidia-cufft-cu12 11.0.2.54 CUFFT native runtime libraries
nvidia-curand-cu12 10.3.2.106 CURAND native runtime libraries
nvidia-cusolver-cu12 11.4.5.107 CUDA solver native runtime libraries
nvidia-cusparse-cu12 12.1.0.106 CUSPARSE native runtime libraries
nvidia-nccl-cu12 2.21.5 NVIDIA Collective Communication Library (NCCL) Runtime
nvidia-nvjitlink-cu12 12.6.85 Nvidia JIT LTO Library
nvidia-nvtx-cu12 12.1.105 NVIDIA Tools Extension
omegaconf 2.3.0 A flexible configuration library
opentelemetry-api 1.28.2 OpenTelemetry Python API
opentelemetry-sdk 1.28.2 OpenTelemetry Python SDK
opentelemetry-semantic-conventions 0.49b2 OpenTelemetry Semantic Conventions
packaging 24.2 Core utilities for Python packages
pandas 2.2.3 Powerful data structures for data analysis, time series, and statistics
pillow 11.0.0 Python Imaging Library (Fork)
pluggy 1.5.0 plugin and hook calling mechanisms for python
propcache 0.2.0 Accelerated property cache
protobuf 5.29.0
psutil 6.1.0 Cross-platform lib for process and system monitoring in Python.
pyarrow 18.1.0 Python library for Apache Arrow
pyasn1 0.6.1 Pure-Python implementation of ASN.1 types and DER/BER/CER codecs (X.208)
pyasn1-modules 0.4.1 A collection of ASN.1-based protocols modules
pycryptodomex 3.21.0 Cryptographic library for Python
pyparsing 3.2.0 pyparsing module - Classes and methods to define and execute parsing grammars
pytest 8.3.3 pytest: simple powerful testing with Python
python-dateutil 2.9.0.post0 Extensions to the standard Python datetime module
python-dotenv 1.0.1 Read key-value pairs from a .env file and set them as environment variables
pytz 2024.2 World timezone definitions, modern and historical
pyyaml 6.0.2 YAML parser and emitter for Python
regex 2024.11.6 Alternative regular expression module, to replace re.
requests 2.32.3 Python HTTP for Humans.
rsa 4.9 Pure-Python RSA implementation
safetensors 0.4.5
scikit-learn 1.5.2 A set of python modules for machine learning and data mining
scipy 1.14.1 Fundamental algorithms for scientific computing in Python
sentencepiece 0.2.0 SentencePiece python wrapper
setuptools 75.6.0 Easily download, build, install, upgrade, and uninstall Python packages
six 1.16.0 Python 2 and 3 compatibility utilities
smmap 5.0.1 A pure Python implementation of a sliding window memory map manager
sqlalchemy 2.0.36 Database Abstraction Library
sqlparse 0.5.2 A non-validating SQL parser.
sympy 1.13.1 Computer algebra system (CAS) in Python
tensorboard 2.18.0 TensorBoard lets you watch Tensors Flow
tensorboard-data-server 0.7.2 Fast data loading for TensorBoard
threadpoolctl 3.5.0 threadpoolctl
tiktoken 0.8.0 tiktoken is a fast BPE tokeniser for use with OpenAI's models
torch 2.5.1+cu121 Tensors and Dynamic neural networks in Python with strong GPU acceleration
torchao 0.6.1+cu121 Package for applying ao techniques to GPU models
torchtune 0.4.0 A native-PyTorch library for LLM fine-tuning
torchvision 0.20.1+cu121 image and video datasets and models for torch deep learning
tqdm 4.67.1 Fast, Extensible Progress Meter
triton 3.1.0 A language and compiler for custom Deep Learning operations
typing-extensions 4.12.2 Backported and Experimental Type Hints for Python 3.8+
tzdata 2024.2 Provider of IANA time zone data
urllib3 2.2.3 HTTP library with thread-safe connection pooling, file post, and more.
werkzeug 3.1.3 The comprehensive WSGI web application library.
wrapt 1.17.0 Module for decorators, wrappers and monkey patching.
xxhash 3.5.0 Python binding for xxHash
yarl 1.18.0 Yet another URL library
zipp 3.21.0 Backport of pathlib-compatible object wrapper for zip files
as me
(mlflow-torchtune-py3.11) fg12@farm22-head1:~/repos/mlflow-tutorial/torchtune$ python3
Python 3.11.6 (main, Nov 16 2023, 10:12:38) [GCC 13.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from torch.utils.tensorboard import SummaryWriter
>>>
YOURS MINE
nvidia-cublas-cu12 12.4.5.8 12.1.3.1
nvidia-cuda-cupti-cu12 12.4.127 12.1.105
nvidia-cuda-nvrtc-cu12 12.4.127 12.1.105
nvidia-cuda-runtime-cu12 12.4.127 12.1.105
nvidia-cudnn-cu12 9.1.0.70 9.1.0.70
nvidia-cufft-cu12 11.2.1.3 11.0.2.54
nvidia-curand-cu12 10.3.5.147 10.3.2.106
nvidia-cusolver-cu12 11.6.1.9 11.4.5.107
nvidia-cusparse-cu12 12.3.1.170 12.1.0.106
nvidia-nccl-cu12 2.21.5 2.21.5
nvidia-nvjitlink-cu12 12.4.127 12.6.85
nvidia-nvtx-cu12 12.4.127 12.1.105
the whole error stack
Traceback (most recent call last):
File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/bin/tune", line 8, in <module>
sys.exit(main())
^^^^^^
File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/_cli/run.py", line 208, in _run_cmd
self._run_single_device(args, is_builtin=is_builtin)
File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/_cli/run.py", line 105, in _run_single_device
runpy.run_module(str(args.recipe), run_name="__main__")
File "<frozen runpy>", line 229, in run_module
File "<frozen runpy>", line 88, in _run_code
File "/nfs/users/nfs_f/fg12/repos/mlflow-tutorial/torchtune/src/full_finetune_single_device.py", line 810, in <module>
sys.exit(recipe_main())
^^^^^^^^^^^^^
File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/config/_parse.py", line 99, in wrapper
sys.exit(recipe_main(conf))
^^^^^^^^^^^^^^^^^
File "/nfs/users/nfs_f/fg12/repos/mlflow-tutorial/torchtune/src/full_finetune_single_device.py", line 804, in recipe_main
recipe.setup(cfg=cfg)
File "/nfs/users/nfs_f/fg12/repos/mlflow-tutorial/torchtune/src/full_finetune_single_device.py", line 258, in setup
self._metric_logger = config.instantiate(cfg.metric_logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/config/_instantiate.py", line 112, in instantiate
return _instantiate_node(OmegaConf.to_object(config), *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/config/_instantiate.py", line 33, in _instantiate_node
return _create_component(_component_, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/config/_instantiate.py", line 22, in _create_component
return _component_(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/training/metric_logging.py", line 288, in __init__
from torch.utils.tensorboard import SummaryWriter
File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torch/utils/tensorboard/__init__.py", line 12, in <module>
from .writer import FileWriter, SummaryWriter # noqa: F401
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torch/utils/tensorboard/writer.py", line 19, in <module>
from ._embedding import get_embedding_info, make_mat, make_sprite, make_tsv, write_pbtxt
File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torch/utils/tensorboard/_embedding.py", line 10, in <module>
_HAS_GFILE_JOIN = hasattr(tf.io.gfile, "join")
^^^^^
File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/tensorboard/lazy.py", line 65, in __getattr__
return getattr(load_once(self), attr_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'tensorflow' has no attribute 'io'
Exception ignored in: <function TensorBoardLogger.__del__ at 0x151d69d445e0>
Traceback (most recent call last):
File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/training/metric_logging.py", line 314, in __del__
if self._writer:
^^^^^^^^^^^^
AttributeError: 'TensorBoardLogger' object has no attribute '_writer'
Thanks for the full trace. As of now, I am still unable to reproduce your error neither via the direct import of SummaryWriter
nor the actual running of a recipe (I'm using LoRA single device w/ Llama3.1 8B as an example and it works fine).
Next I'll try downloading your entire list of packages and kick off a run b/c my only guess at this point is there's some dependency issue.
I want to mention that i am running the job via LSF and openmpi, MlFlow is there, because the idea is to allow our users to use the company tracking server based on mlflow.
maybe this can help
[tool.poetry]
name = "mlflow-torchtune"
version = "0.1.0"
description = "Torchtune example"
authors = ["Fabio Geraci <fg12@sanger.ac.uk>"]
license = "MIT"
readme = "README.md"
[tool.poetry.dependencies]
python = "^3.11"
torch = { version = "^2.0", source = "custom-pytorch" }
torchvision = { version = "^0.20", source = "custom-pytorch" }
torchao = { version = "^0.6", source = "custom-pytorch" }
torchtune = "^0.4.0"
bitsandbytes = "^0.44.1"
tensorboard = "^2.18.0"
matplotlib = "^3.9.2"
mlflow = "^2.18.0"
python-dotenv = "^1.0.1"
[[tool.poetry.source]]
name = "custom-pytorch"
url = "https://download.pytorch.org/whl/cu121"
priority = "explicit"
[tool.poetry.group.dev.dependencies]
pytest = "^8.3.3"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
DiskLogger works corectly