pytorch / torchtune

PyTorch native finetuning library
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
4.39k stars 448 forks source link

TensorBoardLogger: AttributeError: module 'tensorflow' has no attribute 'io' #2090

Open fabiogeraci opened 3 days ago

fabiogeraci commented 3 days ago

DiskLogger works corectly

tensorboard 2.18.0
tensorboard-data-server 0.7.2
torch 2.5.1+cu121 torchao 0.6.1+cu121
torchtune 0.4.0
torchvision 0.20.1+cu121

# Logging
metric_logger:
  _component_: torchtune.training.metric_logging.TensorBoardLogger
  log_dir: ${output_dir}
output_dir: ${env:ARTIFACT_LOCATION}/full-llama3.1-finetune/
log_every_n_steps: 10
log_peak_memory_stats: True

# Profiler (disabled)
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: True

  #Output directory of trace artifacts
  output_dir: ${output_dir}/profiling_outputs

  #`torch.profiler.ProfilerActivity` types to trace
  cpu: True
  cuda: True

  #trace options passed to `torch.profiler.profile`
  profile_memory: True
  with_stack: False
  record_shapes: True
  with_flops: False

  # `torch.profiler.schedule` options:
  # wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
  wait_steps: 5
  warmup_steps: 3
  active_steps: 2
  num_cycles: 1
AttributeError: module 'tensorflow' has no attribute 'io'
Exception ignored in: <function TensorBoardLogger.__del__ at 0x1524dcb2b2e0>
Traceback (most recent call last):
  File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/training/metric_logging.py", line 314, in __del__
    if self._writer:
       ^^^^^^^^^^^^
AttributeError: 'TensorBoardLogger' object has no attribute '_writer'
joecummings commented 3 days ago

Hey @fabiogeraci - taking a look at this. Are you using a distributed or single device recipe?

fabiogeraci commented 3 days ago

Single at the moment

joecummings commented 3 days ago

Got it! It looks like you might be hitting an import error, which is causing the TensorboardLogger to incorrectly shut down b/c it doesn't have time to setup the actual writer (self._writer) on the class. This happened in our WandBLogger too #1322.

I've created a PR that might fix this #2092; however, there's a good chance of some earlier error that might help you resolve this faster. Perhaps looking into AttributeError: module 'tensorflow' has no attribute 'io'? When I Google that, I get taken to this StackOverflow page that has some suggestions.

fabiogeraci commented 3 days ago

In theory it should use from torch.utils.tensorboard import SummaryWriter. could it be an issue witht he torch version?

joecummings commented 3 days ago

I have the following dependency versions:

(test) [jrcummings@devvm4767.pnb0 ~/projects]$ pip list
Package                  Version
------------------------ ----------
absl-py                  2.1.0
filelock                 3.16.1
fsspec                   2024.10.0
grpcio                   1.68.0
Jinja2                   3.1.4
Markdown                 3.7
MarkupSafe               3.0.2
mpmath                   1.3.0
networkx                 3.4.2
numpy                    2.1.3
nvidia-cublas-cu12       12.4.5.8
nvidia-cuda-cupti-cu12   12.4.127
nvidia-cuda-nvrtc-cu12   12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12        9.1.0.70
nvidia-cufft-cu12        11.2.1.3
nvidia-curand-cu12       10.3.5.147
nvidia-cusolver-cu12     11.6.1.9
nvidia-cusparse-cu12     12.3.1.170
nvidia-nccl-cu12         2.21.5
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.4.127
packaging                24.2
pip                      24.2
protobuf                 5.29.0
setuptools               75.1.0
six                      1.16.0
sympy                    1.13.1
tensorboard              2.18.0
tensorboard-data-server  0.7.2
torch                    2.5.1
triton                   3.1.0
typing_extensions        4.12.2
Werkzeug                 3.1.3
wheel                    0.44.0

And I am able to run the following with no problems:

(test) [jrcummings@devvm4767.pnb0 ~/projects]$ python
Python 3.11.10 (main, Oct  3 2024, 07:29:13) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from torch.utils.tensorboard import SummaryWriter
>>>

Maybe a conflicting dependency with tensorflow as pointed out in the above StackOverflow post?

fabiogeraci commented 3 days ago

mine

poetry show
absl-py                            2.1.0        Abseil Python Common Libraries, see https://github.com/abseil/abseil-py.
aiohappyeyeballs                   2.4.3        Happy Eyeballs for asyncio
aiohttp                            3.11.8       Async http client/server framework (asyncio)
aiosignal                          1.3.1        aiosignal: a list of registered asynchronous callbacks
alembic                            1.14.0       A database migration tool for SQLAlchemy.
antlr4-python3-runtime             4.9.3        ANTLR 4.9.3 runtime for Python 3.7
attrs                              24.2.0       Classes Without Boilerplate
bitsandbytes                       0.44.1       k-bit optimizers and matrix multiplication routines.
blinker                            1.9.0        Fast, simple object-to-object and broadcast signaling
blobfile                           3.0.0        Read GCS, ABS and local paths with the same interface, clone of tensorflow.io.gfile
cachetools                         5.5.0        Extensible memoizing collections and decorators
certifi                            2024.8.30    Python package for providing Mozilla's CA Bundle.
charset-normalizer                 3.4.0        The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.
click                              8.1.7        Composable command line interface toolkit
cloudpickle                        3.1.0        Pickler class to extend the standard pickle.Pickler functionality
contourpy                          1.3.1        Python library for calculating contours of 2D quadrilateral grids
cycler                             0.12.1       Composable style cycles
databricks-sdk                     0.38.0       Databricks SDK for Python (Beta)
datasets                           3.1.0        HuggingFace community-driven open-source library of datasets
deprecated                         1.2.15       Python @deprecated decorator to deprecate old python classes, functions or methods.
dill                               0.3.8        serialize all of Python
docker                             7.1.0        A Python library for the Docker Engine API.
filelock                           3.16.1       A platform independent file lock.
flask                              3.1.0        A simple framework for building complex web applications.
fonttools                          4.55.0       Tools to manipulate font files
frozenlist                         1.5.0        A list-like structure which implements collections.abc.MutableSequence
fsspec                             2024.9.0     File-system specification
gitdb                              4.0.11       Git Object Database
gitpython                          3.1.43       GitPython is a Python library used to interact with Git repositories
google-auth                        2.36.0       Google Authentication Library
graphene                           3.4.3        GraphQL Framework for Python
graphql-core                       3.2.5        GraphQL implementation for Python, a port of GraphQL.js, the JavaScript reference implementation for GraphQL.
graphql-relay                      3.2.0        Relay library for graphql-core
greenlet                           3.1.1        Lightweight in-process concurrent programming
grpcio                             1.68.0       HTTP/2-based RPC framework
gunicorn                           23.0.0       WSGI HTTP Server for UNIX
huggingface-hub                    0.26.3       Client library to download and publish models, datasets and other repos on the huggingface.co hub
idna                               3.10         Internationalized Domain Names in Applications (IDNA)
importlib-metadata                 8.5.0        Read metadata from Python packages
iniconfig                          2.0.0        brain-dead simple config-ini parsing
itsdangerous                       2.2.0        Safely pass data to untrusted environments and back.
jinja2                             3.1.4        A very fast and expressive template engine.
joblib                             1.4.2        Lightweight pipelining with Python functions
kiwisolver                         1.4.7        A fast implementation of the Cassowary constraint solver
loguru                             0.7.2        Python logging made (stupidly) simple
lxml                               5.3.0        Powerful and Pythonic XML processing library combining libxml2/libxslt with the ElementTree API.
mako                               1.3.6        A super-fast templating language that borrows the best ideas from the existing templating languages.
markdown                           3.7          Python implementation of John Gruber's Markdown.
markupsafe                         3.0.2        Safely add untrusted strings to HTML/XML markup.
matplotlib                         3.9.2        Python plotting package
mlflow                             2.18.0       MLflow is an open source platform for the complete machine learning lifecycle
mlflow-skinny                      2.18.0       MLflow is an open source platform for the complete machine learning lifecycle
mpmath                             1.3.0        Python library for arbitrary-precision floating-point arithmetic
multidict                          6.1.0        multidict implementation
multiprocess                       0.70.16      better multiprocessing and multithreading in Python
networkx                           3.4.2        Python package for creating and manipulating graphs and networks
numpy                              2.1.3        Fundamental package for array computing in Python
nvidia-cublas-cu12                 12.1.3.1     CUBLAS native runtime libraries
nvidia-cuda-cupti-cu12             12.1.105     CUDA profiling tools runtime libs.
nvidia-cuda-nvrtc-cu12             12.1.105     NVRTC native runtime libraries
nvidia-cuda-runtime-cu12           12.1.105     CUDA Runtime native Libraries
nvidia-cudnn-cu12                  9.1.0.70     cuDNN runtime libraries
nvidia-cufft-cu12                  11.0.2.54    CUFFT native runtime libraries
nvidia-curand-cu12                 10.3.2.106   CURAND native runtime libraries
nvidia-cusolver-cu12               11.4.5.107   CUDA solver native runtime libraries
nvidia-cusparse-cu12               12.1.0.106   CUSPARSE native runtime libraries
nvidia-nccl-cu12                   2.21.5       NVIDIA Collective Communication Library (NCCL) Runtime
nvidia-nvjitlink-cu12              12.6.85      Nvidia JIT LTO Library
nvidia-nvtx-cu12                   12.1.105     NVIDIA Tools Extension
omegaconf                          2.3.0        A flexible configuration library
opentelemetry-api                  1.28.2       OpenTelemetry Python API
opentelemetry-sdk                  1.28.2       OpenTelemetry Python SDK
opentelemetry-semantic-conventions 0.49b2       OpenTelemetry Semantic Conventions
packaging                          24.2         Core utilities for Python packages
pandas                             2.2.3        Powerful data structures for data analysis, time series, and statistics
pillow                             11.0.0       Python Imaging Library (Fork)
pluggy                             1.5.0        plugin and hook calling mechanisms for python
propcache                          0.2.0        Accelerated property cache
protobuf                           5.29.0       
psutil                             6.1.0        Cross-platform lib for process and system monitoring in Python.
pyarrow                            18.1.0       Python library for Apache Arrow
pyasn1                             0.6.1        Pure-Python implementation of ASN.1 types and DER/BER/CER codecs (X.208)
pyasn1-modules                     0.4.1        A collection of ASN.1-based protocols modules
pycryptodomex                      3.21.0       Cryptographic library for Python
pyparsing                          3.2.0        pyparsing module - Classes and methods to define and execute parsing grammars
pytest                             8.3.3        pytest: simple powerful testing with Python
python-dateutil                    2.9.0.post0  Extensions to the standard Python datetime module
python-dotenv                      1.0.1        Read key-value pairs from a .env file and set them as environment variables
pytz                               2024.2       World timezone definitions, modern and historical
pyyaml                             6.0.2        YAML parser and emitter for Python
regex                              2024.11.6    Alternative regular expression module, to replace re.
requests                           2.32.3       Python HTTP for Humans.
rsa                                4.9          Pure-Python RSA implementation
safetensors                        0.4.5        
scikit-learn                       1.5.2        A set of python modules for machine learning and data mining
scipy                              1.14.1       Fundamental algorithms for scientific computing in Python
sentencepiece                      0.2.0        SentencePiece python wrapper
setuptools                         75.6.0       Easily download, build, install, upgrade, and uninstall Python packages
six                                1.16.0       Python 2 and 3 compatibility utilities
smmap                              5.0.1        A pure Python implementation of a sliding window memory map manager
sqlalchemy                         2.0.36       Database Abstraction Library
sqlparse                           0.5.2        A non-validating SQL parser.
sympy                              1.13.1       Computer algebra system (CAS) in Python
tensorboard                        2.18.0       TensorBoard lets you watch Tensors Flow
tensorboard-data-server            0.7.2        Fast data loading for TensorBoard
threadpoolctl                      3.5.0        threadpoolctl
tiktoken                           0.8.0        tiktoken is a fast BPE tokeniser for use with OpenAI's models
torch                              2.5.1+cu121  Tensors and Dynamic neural networks in Python with strong GPU acceleration
torchao                            0.6.1+cu121  Package for applying ao techniques to GPU models
torchtune                          0.4.0        A native-PyTorch library for LLM fine-tuning
torchvision                        0.20.1+cu121 image and video datasets and models for torch deep learning
tqdm                               4.67.1       Fast, Extensible Progress Meter
triton                             3.1.0        A language and compiler for custom Deep Learning operations
typing-extensions                  4.12.2       Backported and Experimental Type Hints for Python 3.8+
tzdata                             2024.2       Provider of IANA time zone data
urllib3                            2.2.3        HTTP library with thread-safe connection pooling, file post, and more.
werkzeug                           3.1.3        The comprehensive WSGI web application library.
wrapt                              1.17.0       Module for decorators, wrappers and monkey patching.
xxhash                             3.5.0        Python binding for xxHash
yarl                               1.18.0       Yet another URL library
zipp                               3.21.0       Backport of pathlib-compatible object wrapper for zip files

as me

(mlflow-torchtune-py3.11) fg12@farm22-head1:~/repos/mlflow-tutorial/torchtune$ python3
Python 3.11.6 (main, Nov 16 2023, 10:12:38) [GCC 13.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from torch.utils.tensorboard import SummaryWriter
>>> 
fabiogeraci commented 3 days ago
    YOURS   MINE
nvidia-cublas-cu12  12.4.5.8    12.1.3.1
nvidia-cuda-cupti-cu12  12.4.127    12.1.105
nvidia-cuda-nvrtc-cu12  12.4.127    12.1.105
nvidia-cuda-runtime-cu12    12.4.127    12.1.105
nvidia-cudnn-cu12   9.1.0.70    9.1.0.70
nvidia-cufft-cu12   11.2.1.3    11.0.2.54
nvidia-curand-cu12  10.3.5.147  10.3.2.106
nvidia-cusolver-cu12    11.6.1.9    11.4.5.107
nvidia-cusparse-cu12    12.3.1.170  12.1.0.106
nvidia-nccl-cu12    2.21.5  2.21.5
nvidia-nvjitlink-cu12   12.4.127    12.6.85
nvidia-nvtx-cu12    12.4.127    12.1.105
fabiogeraci commented 3 days ago

the whole error stack

Traceback (most recent call last):
  File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/bin/tune", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/_cli/run.py", line 208, in _run_cmd
    self._run_single_device(args, is_builtin=is_builtin)
  File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/_cli/run.py", line 105, in _run_single_device
    runpy.run_module(str(args.recipe), run_name="__main__")
  File "<frozen runpy>", line 229, in run_module
  File "<frozen runpy>", line 88, in _run_code
  File "/nfs/users/nfs_f/fg12/repos/mlflow-tutorial/torchtune/src/full_finetune_single_device.py", line 810, in <module>
    sys.exit(recipe_main())
             ^^^^^^^^^^^^^
  File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/config/_parse.py", line 99, in wrapper
    sys.exit(recipe_main(conf))
             ^^^^^^^^^^^^^^^^^
  File "/nfs/users/nfs_f/fg12/repos/mlflow-tutorial/torchtune/src/full_finetune_single_device.py", line 804, in recipe_main
    recipe.setup(cfg=cfg)
  File "/nfs/users/nfs_f/fg12/repos/mlflow-tutorial/torchtune/src/full_finetune_single_device.py", line 258, in setup
    self._metric_logger = config.instantiate(cfg.metric_logger)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/config/_instantiate.py", line 112, in instantiate
    return _instantiate_node(OmegaConf.to_object(config), *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/config/_instantiate.py", line 33, in _instantiate_node
    return _create_component(_component_, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/config/_instantiate.py", line 22, in _create_component
    return _component_(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/training/metric_logging.py", line 288, in __init__
    from torch.utils.tensorboard import SummaryWriter
  File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torch/utils/tensorboard/__init__.py", line 12, in <module>
    from .writer import FileWriter, SummaryWriter  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torch/utils/tensorboard/writer.py", line 19, in <module>
    from ._embedding import get_embedding_info, make_mat, make_sprite, make_tsv, write_pbtxt
  File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torch/utils/tensorboard/_embedding.py", line 10, in <module>
    _HAS_GFILE_JOIN = hasattr(tf.io.gfile, "join")
                              ^^^^^
  File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/tensorboard/lazy.py", line 65, in __getattr__
    return getattr(load_once(self), attr_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'tensorflow' has no attribute 'io'
Exception ignored in: <function TensorBoardLogger.__del__ at 0x151d69d445e0>
Traceback (most recent call last):
  File "/software/isg/users/fg12/envs/virtualenvs/mlflow-torchtune-31tjdLhK-py3.11/lib/python3.11/site-packages/torchtune/training/metric_logging.py", line 314, in __del__
    if self._writer:
       ^^^^^^^^^^^^
AttributeError: 'TensorBoardLogger' object has no attribute '_writer'
joecummings commented 2 days ago

Thanks for the full trace. As of now, I am still unable to reproduce your error neither via the direct import of SummaryWriter nor the actual running of a recipe (I'm using LoRA single device w/ Llama3.1 8B as an example and it works fine).

Next I'll try downloading your entire list of packages and kick off a run b/c my only guess at this point is there's some dependency issue.

fabiogeraci commented 2 days ago

I want to mention that i am running the job via LSF and openmpi, MlFlow is there, because the idea is to allow our users to use the company tracking server based on mlflow.

maybe this can help

[tool.poetry]
name = "mlflow-torchtune"
version = "0.1.0"
description = "Torchtune example"
authors = ["Fabio Geraci <fg12@sanger.ac.uk>"]
license = "MIT"
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.11"
torch = { version = "^2.0", source = "custom-pytorch" }
torchvision = { version = "^0.20", source = "custom-pytorch" }
torchao = { version = "^0.6", source = "custom-pytorch" }
torchtune = "^0.4.0"
bitsandbytes = "^0.44.1"
tensorboard = "^2.18.0"
matplotlib = "^3.9.2"
mlflow = "^2.18.0"
python-dotenv = "^1.0.1"

[[tool.poetry.source]]
name = "custom-pytorch"
url = "https://download.pytorch.org/whl/cu121"
priority = "explicit"

[tool.poetry.group.dev.dependencies]
pytest = "^8.3.3"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"