tensorflow / tensorboard

TensorFlow's Visualization Toolkit
Apache License 2.0
6.65k stars 1.65k forks source link

Read record error #5116

Closed tradingjunkie closed 3 years ago

tradingjunkie commented 3 years ago

Environment TensorBoard 2.5.0; Tensorflow 2.5.0

Please run diagnose_tensorboard.py (link below) in the same environment from which you normally run TensorFlow/TensorBoard, and paste the output here:

Diagnostics

Diagnostics output `````` --- check: autoidentify INFO: diagnose_tensorboard.py version e43767ef2b648d0d5d57c00f38ccbd38390e38da --- check: general INFO: sys.version_info: sys.version_info(major=3, minor=6, micro=9, releaselevel='final', serial=0) INFO: os.name: posix INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='3e52c33c1851', release='5.4.0-74-generic', version='#83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021', machine='x86_64') INFO: sys.getwindowsversion(): N/A --- check: package_management INFO: has conda-meta: False INFO: $VIRTUAL_ENV: None WARNING: The directory '/home/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. --- check: installed_packages INFO: installed: tensorboard==2.5.0 INFO: installed: tensorflow==2.5.0 INFO: installed: tensorflow-estimator==2.5.0rc0 INFO: installed: tensorboard-data-server==0.6.1 --- check: tensorboard_python_version INFO: tensorboard.version.VERSION: '2.5.0' 2021-07-08 20:57:09.014915: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 --- check: tensorflow_python_version INFO: tensorflow.__version__: '2.5.0' INFO: tensorflow.__git_version__: 'v2.5.0-rc3-213-ga4dfb8d1a71' --- check: tensorboard_data_server_version INFO: data server binary: '/usr/local/lib/python3.6/dist-packages/tensorboard_data_server/bin/server' Traceback (most recent call last): File "/workspace/diagnose_tensorboard.py", line 522, in main suggestions.extend(check()) File "/workspace/diagnose_tensorboard.py", line 75, in wrapper result = fn() File "/workspace/diagnose_tensorboard.py", line 301, in tensorboard_data_server_version check=True, File "/usr/lib/python3.6/subprocess.py", line 423, in run with Popen(*popenargs, **kwargs) as process: TypeError: __init__() got an unexpected keyword argument 'capture_output' --- check: tensorboard_binary_path INFO: which tensorboard: b'/usr/local/bin/tensorboard\n' --- check: addrinfos socket.has_ipv6 = True socket.AF_UNSPEC = socket.SOCK_STREAM = socket.AI_ADDRCONFIG = socket.AI_PASSIVE = Loopback flags: Loopback infos: [(, , 6, '', ('127.0.0.1', 0))] Wildcard flags: Wildcard infos: [(, , 6, '', ('0.0.0.0', 0)), (, , 6, '', ('::', 0, 0, 0))] --- check: readable_fqdn INFO: socket.getfqdn(): '3e52c33c1851' --- check: stat_tensorboardinfo INFO: directory: /tmp/.tensorboard-info INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=107217744, st_dev=53, st_nlink=2, st_uid=3003, st_gid=3003, st_size=4096, st_atime=1624626790, st_mtime=1625777517, st_ctime=1625777517) INFO: mode: 0o40777 --- check: source_trees_without_genfiles INFO: tensorboard_roots (1): ['/usr/local/lib/python3.6/dist-packages']; bad_roots (0): [] WARNING: The directory '/home/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. --- check: full_pip_freeze INFO: pip freeze --all: absl-py==0.12.0 anyio==3.2.1 appdirs==1.4.4 argon2-cffi==20.1.0 asn1crypto==0.24.0 astunparse==1.6.3 async-generator==1.10 attrs==21.2.0 audioread==2.1.9 Babel==2.9.1 backcall==0.2.0 bleach==3.3.0 cached-property==1.5.2 cachetools==4.2.2 certifi==2020.12.5 cffi==1.14.5 chardet==4.0.0 cloudpickle==1.6.0 contextvars==2.4 cryptography==2.1.4 cycler==0.10.0 dataclasses==0.8 decorator==5.0.9 defusedxml==0.7.1 dill==0.3.3 dm-tree==0.1.6 entrypoints==0.3 flatbuffers==1.12 future==0.18.2 gast==0.4.0 google-auth==1.30.0 google-auth-oauthlib==0.4.4 google-pasta==0.2.0 googleapis-common-protos==1.53.0 graphviz==0.16 grpcio==1.34.1 h5py==3.1.0 horovod==0.22.0 idna==2.6 immutables==0.15 importlib-metadata==4.0.1 importlib-resources==5.1.3 ipykernel==5.5.5 ipython==7.16.1 ipython-genutils==0.2.0 jedi==0.18.0 Jinja2==3.0.1 joblib==1.0.1 json5==0.9.6 jsonschema==3.2.0 jupyter-client==6.1.12 jupyter-core==4.7.1 jupyter-server==1.9.0 jupyterlab==3.0.16 jupyterlab-pygments==0.1.2 jupyterlab-server==2.6.0 keras-nightly==2.5.0.dev2021032900 Keras-Preprocessing==1.1.2 keyring==10.6.0 keyrings.alt==3.0 kiwisolver==1.3.1 librosa==0.8.1 llvmlite==0.36.0 Markdown==3.3.4 MarkupSafe==2.0.1 matplotlib==3.3.4 mistune==0.8.4 nbclassic==0.3.1 nbclient==0.5.3 nbconvert==6.0.7 nbformat==5.1.3 nest-asyncio==1.5.1 notebook==6.4.0 numba==0.53.1 numpy==1.19.5 oauthlib==3.1.0 opt-einsum==3.3.0 packaging==20.9 pandocfilters==1.4.3 parso==0.8.2 pexpect==4.8.0 pickleshare==0.7.5 Pillow==8.2.0 pip==20.2.4 pooch==1.3.0 prometheus-client==0.11.0 promise==2.3 prompt-toolkit==3.0.19 protobuf==3.17.0 psutil==5.8.0 ptyprocess==0.7.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycparser==2.20 pycrypto==2.6.1 pygit2==1.6.1 Pygments==2.9.0 pygobject==3.26.1 pyparsing==2.4.7 pyrsistent==0.17.3 python-apt==1.6.5+ubuntu0.5 python-dateutil==2.8.1 pytz==2021.1 pyxdg==0.25 PyYAML==5.4.1 pyzmq==22.1.0 requests==2.25.1 requests-oauthlib==1.3.0 requests-unixsocket==0.2.0 resampy==0.2.2 rsa==4.7.2 scikit-learn==0.24.2 scipy==1.5.4 SecretStorage==2.3.1 Send2Trash==1.7.1 setuptools==56.2.0 six==1.15.0 sniffio==1.2.0 SoundFile==0.10.3.post1 ssh-import-id==5.7 tensorboard==2.5.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.0 tensorflow==2.5.0 tensorflow-addons==0.13.0 tensorflow-datasets==4.3.0 tensorflow-estimator==2.5.0rc0 tensorflow-metadata==0.30.0 tensorflow-probability==0.12.2 termcolor==1.1.0 terminado==0.10.1 testpath==0.5.0 threadpoolctl==2.1.0 tornado==6.1 tqdm==4.60.0 traitlets==4.3.3 typeguard==2.12.0 typing-extensions==3.7.4.3 urllib3==1.26.4 wcwidth==0.2.5 webencodings==0.5.1 websocket-client==1.1.0 Werkzeug==2.0.0 wheel==0.36.2 wrapt==1.12.1 zipp==3.4.1 ``````

Next steps

No action items identified. Please copy ALL of the above output, including the lines containing only backticks, into your GitHub issue or comment. Be sure to redact any sensitive information.

Issue I see the following error when trying to run Tensorboard to visualize logs:

[ WARN rustboard_core::run] Read error in /workspace/log/events.out.tfevents.1624569739.8a1c50246dae.30295.276717.v2: ReadRecordError(BadLengthCrc(ChecksumError { got: MaskedCrc(0x07980329), want: MaskedCrc(0x00000000) }))

For some context, I am using a shared file system between several machines and this error comes up when trying to run Tensorboard on machine A, pointing to a log generated by machine B.

psybuzz commented 3 years ago

Thanks for the report. Summaries written by TensorFlow's tf.summary.* APIs are supposed to produce event files with proper checksums. The error could indicate that either the event file in question has been modified, or perhaps was written in an unexpected way. Could you share how is machine B logging events?

If the machine in question is not using something like

import tensorflow as tf
writer = tf.summary.create_file_writer('test/logdir')
with writer.as_default():
    tf.summary.scalar('loss', 0.345, step=1)

would it be possible to share the summary writing code for us to investigate?

If you trust the source that logs your summary data, and do not care about this checksum warning, it is also possible to suppress the check by passing extra flags to tensorboard: tensorboard --logdir my_logdir --extra_data_server_flags=--no-checksum

arghyaganguly commented 3 years ago

@tradingjunkie , please share the information (machine B logging events, code used to write summary) as asked by @psybuzz in his last comment.

arghyaganguly commented 3 years ago

@tradingjunkie , closing this as this ticket has been inactive awaiting your response for some time.Please feel free to reopen if required.Thanks.

vadimcn commented 2 years ago

Had this exact problem when TB was reading logs exported from another machine via a NFS share.

I've instrumented RustBoard to print out the contents of the last read block on CRC errors, and sure enough, it was all zeros!
Probably not RustBoard's fault though, I would be inclined to blame this on a faulty NFS driver, especially since similar bugs had happened in the past.

Interestingly enough, this problem does not occur when the log files are read from Python (which I tried to create a repro case). Not sure why, maybe this is just due to the difference in speed. But this allowed me to work around it by writing a Python script that replicates logs to a local file system from where they are read by TB.

svobora commented 2 years ago

This is still an issue. When multiple machines are writing logs to a shared samba storage, the logs are not processes correctly. Sometimes tensorboard reads several epochs before the failure. Perhaps the read operation is performed during writing of the log file, and on first sign of error, tensorboard blocks the offending file and is not trying to re-read the log file later.

[2022-05-23T09:08:56Z WARN rustboard_core::run] Read error in ./logs/20220520-110941/events.out.tfevents.1653045009.g12.3001.0.v2: ReadRecordError(BadLengthCrc(ChecksumError { got: MaskedCrc(0x07980329), want: MaskedCrc(0x00000000) }))

To get around the issue, I need to restart tensorboard every time I want to look at the results.

HannesStark commented 2 years ago

@svobora did you find a solution to this issue?

vadimcn commented 2 years ago

Perhaps the read operation is performed during writing of the log file...

Log records are length-prefixed and TB log reader correctly waits till it has read the requisite number of bytes before processing the record (I've studied log reader source). I can't imagine how this error can occur without a screw-up somewhere in the filesystem stack. So technically TB is not at fault here.

Still, it would be nice if on a checksum error, it would try to recover by re-reading all bytes after the last successfully processed record. Unfortunately, I could not find an easy way to do this, since TB reader is built around non seek-able streams, so it would require way too much change for a drive-by pull request.

@arghyaganguly Any chance you could re-open this issue and have TB devs look into implementing a fix?

zhixuanli commented 1 year ago

This is still an issue. When multiple machines are writing logs to a shared samba storage, the logs are not processes correctly. Sometimes tensorboard reads several epochs before the failure. Perhaps the read operation is performed during writing of the log file, and on first sign of error, tensorboard blocks the offending file and is not trying to re-read the log file later.

[2022-05-23T09:08:56Z WARN rustboard_core::run] Read error in ./logs/20220520-110941/events.out.tfevents.1653045009.g12.3001.0.v2: ReadRecordError(BadLengthCrc(ChecksumError { got: MaskedCrc(0x07980329), want: MaskedCrc(0x00000000) }))

To get around the issue, I need to restart tensorboard every time I want to look at the results.

Still having this problem.

svobora commented 1 year ago

Perhaps the read operation is performed during writing of the log file...

Log records are length-prefixed and TB log reader correctly waits till it has read the requisite number of bytes before processing the record (I've studied log reader source). I can't imagine how this error can occur without a screw-up somewhere in the filesystem stack. So technically TB is not at fault here.

Still, it would be nice if on a checksum error, it would try to recover by re-reading all bytes after the last successfully processed record. Unfortunately, I could not find an easy way to do this, since TB reader is built around non seek-able streams, so it would require way too much change for a drive-by pull request.

@arghyaganguly Any chance you could re-open this issue and have TB devs look into implementing a fix?

Sorry but it is typical for bugs that the devs "can't imagine" how it can happen... The obvious fix is to not add the CRC-failed log files to ignore list, possibly trying to read the files again later, with errors silenced. Another solution is to write log files to tmp files and only move them to final location (atomic) once they are written to drive.

siyi-wind commented 1 year ago

Using fast data loading tensorboard --logdir /path/to/logs --load_fast true solved this issue. https://github.com/tensorflow/tensorboard/issues/4784

zhixuanli commented 1 year ago

Using fast data loading tensorboard --logdir /path/to/logs --load_fast true solved this issue. #4784

Thanks for your suggestion!

svobora commented 1 year ago

Using fast data loading tensorboard --logdir /path/to/logs --load_fast true solved this issue. #4784

The problem persists.

rkechols commented 1 year ago

I'm getting the same problem after copying the log files from a remote machine (Linux) to my local machine (MacOS). Does this mean that TB logs can only be read on the machine where they were created? Copying the file somehow corrupts it?

nfelt commented 1 year ago

@rkechols There shouldn't be any constraint about TB logs only being readable on certain machines; the file format itself should be portable. Most likely the file itself is being corrupted during the copy somehow. I'd suggest taking a checksum of the file contents before and after copying it (e.g. the SHA-256 hash or something) to confirm if that's happening.

rkechols commented 1 year ago

@nfelt it turns out you were right that the file was corrupted "in transit". I re-copied the file a different way and had no issue.

svobora commented 1 year ago

@rkechols Would you mind not spamming a topic irrelevant to your issue?

nfelt commented 1 year ago

@svobora That kind of remark is uncalled for, please be polite if you're going to participate in the issue thread. I agree it's not ultimately the exact same issue, but it's reasonable that @rkechols was confused.

If your concern is that this issue should be re-opened, I can re-open and re-title it accordingly, but there's no guarantee that we'll address it right away, especially since this issue only affects cases where the file transiently appears to be corrupted.

B217040 commented 1 year ago

@nfelt it turns out you were right that the file was corrupted "in transit". I re-copied the file a different way and had no issue.

I'm trying to do the same thing and having the same issue - I'm transferring the log file through git. Which way did you get it to work ?

svobora commented 1 year ago

Still having the issue, I was drinking tea when the issue happened. Do you know how high water temperature is optimal for green tea?

Just kidding, I solved the issue by switching from EXT4 to BTRFS on the target network storage (Synology device).

qazi0 commented 6 months ago

Still having the issue, I was drinking tea when the issue happened. Do you know how high water temperature is optimal for green tea?

Just kidding, I solved the issue by switching from EXT4 to BTRFS on the target network storage (Synology device).

@svobora can you please explain? While running in an Azure Linux amd64 VM (ext4) I'm running into this error while running parallel training trials with Ray Tune. Are you suggesting there's no way to get to work in ext4?

svobora commented 5 months ago

@svobora can you please explain? While running in an Azure Linux amd64 VM (ext4) I'm running into this error while running parallel training trials with Ray Tune. Are you suggesting there's no way to get to work in ext4?

As I said, I never managed to fix it until I switched the filesystem. Then it finally started working.