Tensorboard logging fails after few epochs of training and logging

SM1991CODES commented 1 year ago

Environment information (required)

Diagnostics

Diagnostics output

`````` --- check: autoidentify INFO: diagnose_tensorboard.py version 516a2f9433ba4f9c3a4fdb0f89735870eda054a1 --- check: general INFO: sys.version_info: sys.version_info(major=3, minor=8, micro=13, releaselevel='final', serial=0) INFO: os.name: posix INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='xxxxxx', release='3.10.0-1160.el7.x86_64', version='#1 SMP Mon Oct 19 16:18:59 UTC 2020', machine='x86_64') INFO: sys.getwindowsversion(): N/A --- check: package_management INFO: has conda-meta: True INFO: $VIRTUAL_ENV: None --- check: installed_packages INFO: installed: tensorboard==2.10.0 WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview'] WARNING: no installation among: ['tensorflow-estimator', 'tensorflow-estimator-2.0-preview', 'tf-estimator-nightly'] INFO: installed: tensorboard-data-server==0.6.1 --- check: tensorboard_python_version INFO: tensorboard.version.VERSION: '2.10.0' --- check: tensorflow_python_version Traceback (most recent call last): File "diagnose_tensorboard.py", line 528, in main suggestions.extend(check()) File "diagnose_tensorboard.py", line 81, in wrapper result = fn() File "diagnose_tensorboard.py", line 284, in tensorflow_python_version import tensorflow as tf ModuleNotFoundError: No module named 'tensorflow' --- check: tensorboard_data_server_version INFO: data server binary: '/xxxx/miniconda3/envs/torchcon38/lib/python3.8/site-packages/tensorboard_data_server/bin/server' INFO: failed to check binary version: Command '['/xxxx/miniconda3/envs/torchcon38/lib/python3.8/site-packages/tensorboard_data_server/bin/server', '--version']' returned non-zero exit status 1. --- check: tensorboard_binary_path INFO: which tensorboard: b'/xxxx/miniconda3/envs/torchcon38/bin/tensorboard\n' --- check: addrinfos socket.has_ipv6 = True socket.AF_UNSPEC = socket.SOCK_STREAM = socket.AI_ADDRCONFIG = socket.AI_PASSIVE = Loopback flags: Loopback infos: [(, , 6, '', ('::1', 0, 0, 0)), (, , 6, '', ('127.0.0.1', 0))] Wildcard flags: Wildcard infos: [(, , 6, '', ('0.0.0.0', 0)), (, , 6, '', ('::', 0, 0, 0))] --- check: readable_fqdn INFO: socket.getfqdn(): 'xxxxxxx' --- check: stat_tensorboardinfo INFO: directory: /usr/tmp/.tensorboard-info INFO: .tensorboard-info directory does not exist --- check: source_trees_without_genfiles INFO: tensorboard_roots (1): ['/xxxx/miniconda3/envs/torchcon38/lib/python3.8/site-packages']; bad_roots (0): [] --- check: full_pip_freeze INFO: pip freeze --all: absl-py==1.2.0 aiohttp==3.8.1 aiosignal==1.2.0 apptools @ file:///home/conda/feedstock_root/build_artifacts/apptools_1610582543268/work argcomplete==1.12.3 async-timeout==4.0.2 attrs==21.4.0 av==9.2.0 av2==0.2.1 brotlipy @ file:///home/conda/feedstock_root/build_artifacts/brotlipy_1648854175163/work cachetools==5.2.0 certifi==2022.6.15 cffi @ file:///tmp/abs_98z5h56wf8/croots/recipe/cffi_1659598650955/work charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1655906222726/work click==8.1.3 colorlog==6.6.0 commonmark==0.9.1 configobj==5.0.6 cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography_1657174007680/work cycler==0.11.0 distlib==0.3.5 emoji==2.0.0 envisage @ file:///home/conda/feedstock_root/build_artifacts/envisage_1623953337124/work filelock==3.7.1 fire==0.4.0 fonttools==4.34.4 frozenlist==1.3.0 google-auth==2.11.0 google-auth-oauthlib==0.4.6 grpcio==1.49.0 idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1642433548627/work importlib-metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1653252793585/work importlib-resources @ file:///home/conda/feedstock_root/build_artifacts/importlib_resources_1655356668708/work joblib==1.1.0 kdutils @ file:///usr/scratch4/samo4615/Documents/codeworks/perception_dar2/kitti_utils_pack kiwisolver==1.4.4 llvmlite==0.38.1 loguru @ file:///home/conda/feedstock_root/build_artifacts/loguru_1649442969129/work Markdown==3.4.1 MarkupSafe==2.1.1 matplotlib==3.4.0 mayavi @ file:///home/conda/feedstock_root/build_artifacts/mayavi_1657592491811/work mkl-fft==1.3.1 mkl-random==1.2.2 mkl-service==2.4.0 multidict==6.0.2 nox==2022.1.7 numba==0.55.2 numpy @ file:///opt/conda/conda-bld/numpy_and_numpy_base_1652801679809/work oauthlib==3.2.1 olefile @ file:///home/conda/feedstock_root/build_artifacts/olefile_1602866521163/work opencv-python==4.6.0.66 packaging==21.3 pandas==1.4.3 Pillow==6.2.1 pip==22.1.2 platformdirs==2.5.2 prettytable==3.4.1 protobuf==3.19.5 psutil==5.9.1 py==1.11.0 pyarrow==8.0.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1636257122734/work pyface @ file:///home/conda/feedstock_root/build_artifacts/pyface_1647442605190/work Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1650904496387/work pyOpenSSL @ file:///home/conda/feedstock_root/build_artifacts/pyopenssl_1643496850550/work pyparsing==3.0.9 pyproj==3.3.1 pyquaternion==0.9.9 PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1648857275402/work python-dateutil==2.8.2 pytorch-quantization==2.1.2 pytz==2022.1 PyYAML==6.0 requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1656534056640/work requests-oauthlib==1.3.1 rich==12.5.1 rsa==4.9 scipy==1.8.1 setuptools==61.2.0 sip==4.19.13 six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work sphinx-glpi-theme==0.3 tensorboard==2.10.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 termcolor==1.1.0 torch==1.12.1 torch-summary==1.4.5 torchaudio==0.12.1 torchmetrics==0.9.3 torchvision==0.13.1 tqdm==4.64.1 traits @ file:///home/conda/feedstock_root/build_artifacts/traits_1649412915995/work traitsui @ file:///home/conda/feedstock_root/build_artifacts/traitsui_1656509629024/work typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1656706066251/work urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1658789158161/work virtualenv==20.16.2 vtk==9.1.0 wcwidth==0.2.5 Werkzeug==2.2.2 wheel==0.37.1 wslink==1.6.6 yarl==1.7.2 zipp==3.8.1 ``````

Issue description

I am using tensorboard from Pytorch to log my training. My training works fine for some epochs. In each batch, I log the loss, I also log losses per epoch etc. At random places such as epoch 16, 18, 28 (in this case), training get's stuck (does not crash but also does not progress). The error messages seem to point at something going wrong with tensorboard logging.

I also get a Remote IO error - OSError(121) regarding events.out.tfevents write

Please help.

bileschi commented 1 year ago

Is your log directory on a local disk or are you logging to a remote file system? Can you include the full error log? Thanks.

groszewn commented 1 year ago

It's possible this is related to #6167 (writer hanging on Exception in background thread), which should be fixed by #6168.

groszewn commented 1 year ago

Hi @SM1991CODES, I believe #6168 should have resolved this. Feel free to reopen if this is not the case.

tensorflow / tensorboard