Closed SM1991CODES closed 1 year ago
Is your log directory on a local disk or are you logging to a remote file system? Can you include the full error log? Thanks.
It's possible this is related to #6167 (writer hanging on Exception in background thread), which should be fixed by #6168.
Hi @SM1991CODES, I believe #6168 should have resolved this. Feel free to reopen if this is not the case.
Environment information (required)
Diagnostics
Diagnostics output
`````` --- check: autoidentify INFO: diagnose_tensorboard.py version 516a2f9433ba4f9c3a4fdb0f89735870eda054a1 --- check: general INFO: sys.version_info: sys.version_info(major=3, minor=8, micro=13, releaselevel='final', serial=0) INFO: os.name: posix INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='xxxxxx', release='3.10.0-1160.el7.x86_64', version='#1 SMP Mon Oct 19 16:18:59 UTC 2020', machine='x86_64') INFO: sys.getwindowsversion(): N/A --- check: package_management INFO: has conda-meta: True INFO: $VIRTUAL_ENV: None --- check: installed_packages INFO: installed: tensorboard==2.10.0 WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview'] WARNING: no installation among: ['tensorflow-estimator', 'tensorflow-estimator-2.0-preview', 'tf-estimator-nightly'] INFO: installed: tensorboard-data-server==0.6.1 --- check: tensorboard_python_version INFO: tensorboard.version.VERSION: '2.10.0' --- check: tensorflow_python_version Traceback (most recent call last): File "diagnose_tensorboard.py", line 528, in main suggestions.extend(check()) File "diagnose_tensorboard.py", line 81, in wrapper result = fn() File "diagnose_tensorboard.py", line 284, in tensorflow_python_version import tensorflow as tf ModuleNotFoundError: No module named 'tensorflow' --- check: tensorboard_data_server_version INFO: data server binary: '/xxxx/miniconda3/envs/torchcon38/lib/python3.8/site-packages/tensorboard_data_server/bin/server' INFO: failed to check binary version: Command '['/xxxx/miniconda3/envs/torchcon38/lib/python3.8/site-packages/tensorboard_data_server/bin/server', '--version']' returned non-zero exit status 1. --- check: tensorboard_binary_path INFO: which tensorboard: b'/xxxx/miniconda3/envs/torchcon38/bin/tensorboard\n' --- check: addrinfos socket.has_ipv6 = True socket.AF_UNSPEC =Issue description
I am using tensorboard from Pytorch to log my training. My training works fine for some epochs. In each batch, I log the loss, I also log losses per epoch etc. At random places such as epoch 16, 18, 28 (in this case), training get's stuck (does not crash but also does not progress). The error messages seem to point at something going wrong with tensorboard logging.
I also get a Remote IO error - OSError(121) regarding events.out.tfevents write
Please help.