tensorflow / tensorboard

TensorFlow's Visualization Toolkit
Apache License 2.0
6.73k stars 1.66k forks source link

Tensorboard Summary Scalars Don't Write Until Training is Over #6139

Closed andrejfd closed 1 year ago

andrejfd commented 1 year ago

Environment information (required)

Diagnostics

Diagnostics output `````` --- check: autoidentify INFO: diagnose_tensorboard.py version 516a2f9433ba4f9c3a4fdb0f89735870eda054a1 --- check: general INFO: sys.version_info: sys.version_info(major=3, minor=9, micro=5, releaselevel='final', serial=0) INFO: os.name: posix INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='1122-034244-vmhw0z4b-10-28-75-209', release='5.4.0-1088-aws', version='#96~18.04.1-Ubuntu SMP Mon Oct 17 02:57:48 UTC 2022', machine='x86_64') INFO: sys.getwindowsversion(): N/A --- check: package_management INFO: has conda-meta: False INFO: $VIRTUAL_ENV: '/local_disk0/.ephemeral_nfs/envs/pythonEnv-a4b38f71-a5b9-4070-9845-3788f835b006' --- check: installed_packages INFO: installed: tensorboard==2.9.1 INFO: installed: tensorflow==2.9.1 INFO: installed: tensorflow-estimator==2.9.0 INFO: installed: tensorboard-data-server==0.6.1 --- check: tensorboard_python_version INFO: tensorboard.version.VERSION: '2.9.1' --- check: tensorflow_python_version 2023-01-11 15:33:15.361331: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. INFO: tensorflow.__version__: '2.9.1' INFO: tensorflow.__git_version__: 'v2.9.0-18-gd8ce9f9c301' --- check: tensorboard_data_server_version INFO: data server binary: '/databricks/python/lib/python3.9/site-packages/tensorboard_data_server/bin/server' INFO: data server binary version: b'rustboard 0.6.1' --- check: tensorboard_binary_path INFO: which tensorboard: b'/databricks/python3/bin/tensorboard\n' --- check: addrinfos socket.has_ipv6 = True socket.AF_UNSPEC = socket.SOCK_STREAM = socket.AI_ADDRCONFIG = socket.AI_PASSIVE = Loopback flags: Loopback infos: [(, , 6, '', ('::1', 0, 0, 0)), (, , 6, '', ('127.0.0.1', 0))] Wildcard flags: Wildcard infos: [(, , 6, '', ('0.0.0.0', 0)), (, , 6, '', ('::', 0, 0, 0))] --- check: readable_fqdn INFO: socket.getfqdn(): '1122-034244-vmhw0z4b-10-28-75-209' --- check: stat_tensorboardinfo INFO: directory: /tmp/.tensorboard-info INFO: .tensorboard-info directory does not exist --- check: source_trees_without_genfiles INFO: tensorboard_roots (1): ['/databricks/python/lib/python3.9/site-packages']; bad_roots (0): [] --- check: full_pip_freeze INFO: pip freeze --all: absl-py==1.0.0 argon2-cffi==20.1.0 astor==0.8.1 astunparse==1.6.3 async-generator==1.10 attrs==21.2.0 azure-core==1.22.1 azure-cosmos==4.2.0 backcall==0.2.0 backports.entry-points-selectable==1.1.1 bcrypt==4.0.0 black==22.3.0 bleach==4.0.0 blis==0.7.8 boto3==1.21.18 botocore==1.24.18 cachetools==5.2.0 catalogue==2.0.8 certifi==2021.10.8 cffi==1.14.6 chardet==4.0.0 charset-normalizer==2.0.4 click==8.0.3 cloudpickle==2.0.0 cmdstanpy==0.9.68 confection==0.0.1 configparser==5.2.0 convertdate==2.4.0 cryptography==3.4.8 cycler==0.10.0 cymem==2.0.6 Cython==0.29.24 databricks-automl-runtime==0.2.11 databricks-cli==0.17.3 dbl-tempo==0.1.12 dbus-python==1.2.16 debugpy==1.4.1 decorator==5.1.0 defusedxml==0.7.1 dill==0.3.4 diskcache==5.4.0 distlib==0.3.6 distro==1.4.0 distro-info===0.23ubuntu1 entrypoints==0.3 ephem==4.1.3 facets-overview==1.0.0 fasttext==0.9.2 filelock==3.3.1 Flask==1.1.2 flatbuffers==1.12 fsspec==2021.8.1 future==0.18.2 gast==0.4.0 gitdb==4.0.9 GitPython==3.1.27 google-auth==2.6.0 google-auth-oauthlib==0.4.6 google-pasta==0.2.0 grpcio==1.44.0 gunicorn==20.1.0 gviz-api==1.10.0 h5py==3.3.0 hijri-converter==2.2.4 holidays==0.15 horovod==0.25.0 htmlmin==0.1.12 huggingface-hub==0.9.1 idna==3.2 ImageHash==4.3.0 imbalanced-learn==0.8.1 importlib-metadata==4.8.1 ipykernel==6.12.1 ipython==7.32.0 ipython-genutils==0.2.0 ipywidgets==7.7.0 isodate==0.6.1 itsdangerous==2.0.1 jedi==0.18.0 Jinja2==2.11.3 jmespath==0.10.0 joblib==1.0.1 joblibspark==0.5.0 jsonschema==3.2.0 jupyter-client==6.1.12 jupyter-core==4.8.1 jupyterlab-pygments==0.1.2 jupyterlab-widgets==1.0.0 keras==2.9.0 Keras-Preprocessing==1.1.2 kiwisolver==1.3.1 korean-lunar-calendar==0.3.1 langcodes==3.3.0 libclang==14.0.6 lightgbm==3.3.2 llvmlite==0.37.0 LunarCalendar==0.0.9 Mako==1.2.0 Markdown==3.3.6 MarkupSafe==2.0.1 matplotlib==3.4.3 matplotlib-inline==0.1.2 missingno==0.5.1 mistune==0.8.4 mleap==0.20.0 mlflow-databricks-artifacts==2.0.0 mlflow-skinny==1.29.0 multimethod==1.9 murmurhash==1.0.8 mypy-extensions==0.4.3 nbclient==0.5.3 nbconvert==6.1.0 nbformat==5.1.3 nest-asyncio==1.5.1 networkx==2.6.3 nltk==3.6.5 notebook==6.4.5 numba==0.54.1 numpy==1.20.3 oauthlib==3.2.0 opt-einsum==3.3.0 packaging==21.0 pandas==1.3.4 pandas-profiling==3.1.0 pandocfilters==1.4.3 paramiko==2.9.2 parso==0.8.2 pathspec==0.9.0 pathy==0.6.2 patsy==0.5.2 petastorm==0.12.0 pexpect==4.8.0 phik==0.12.2 pickleshare==0.7.5 Pillow==8.4.0 pip==21.2.4 platformdirs==2.5.2 plotly==5.9.0 pmdarima==1.8.5 preshed==3.0.7 prompt-toolkit==3.0.20 prophet==1.0.1 protobuf==3.19.4 psutil==5.8.0 psycopg2==2.9.3 ptyprocess==0.7.0 py4j==0.10.9.5 pyarrow==7.0.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pybind11==2.10.0 pycparser==2.20 pydantic==1.9.2 Pygments==2.10.0 PyGObject==3.36.0 PyJWT==2.5.0 PyMeeus==0.5.11 PyNaCl==1.5.0 pyodbc==4.0.31 pyparsing==3.0.4 pyrsistent==0.18.0 pyspark==3.3.1 pystan==2.19.1.1 python-apt==2.0.0+ubuntu0.20.4.8 python-dateutil==2.8.2 python-editor==1.0.4 pytz==2021.3 PyWavelets==1.1.1 PyYAML==6.0 pyzmq==22.2.1 regex==2021.8.3 requests==2.26.0 requests-oauthlib==1.3.1 requests-unixsocket==0.2.0 rsa==4.9 s3transfer==0.5.2 scikit-learn==0.24.2 scipy==1.7.1 seaborn==0.11.2 Send2Trash==1.8.0 setuptools==58.0.4 setuptools-git==1.2 shap==0.41.0 simplejson==3.17.6 six==1.16.0 slicer==0.0.7 smart-open==5.2.1 smmap==5.0.0 spacy==3.4.1 spacy-legacy==3.0.10 spacy-loggers==1.0.3 spark-tensorflow-distributor==1.0.0 sqlparse==0.4.2 srsly==2.4.4 ssh-import-id==5.10 statsmodels==0.12.2 tabulate==0.8.9 tangled-up-in-unicode==0.1.0 tenacity==8.0.1 tensorboard==2.9.1 tensorboard-data-server==0.6.1 tensorboard-plugin-profile==2.8.0 tensorboard-plugin-wit==1.8.1 tensorflow==2.9.1 tensorflow-estimator==2.9.0 tensorflow-io-gcs-filesystem==0.27.0 termcolor==2.0.1 terminado==0.9.4 testpath==0.5.0 thinc==8.1.2 threadpoolctl==2.2.0 tokenize-rt==4.2.1 tokenizers==0.12.1 tomli==2.0.1 torch==1.12.1+cu113 torchvision==0.13.1+cu113 tornado==6.1 tqdm==4.62.3 traitlets==5.1.0 transformers==4.21.2 typer==0.4.2 typing-extensions==3.10.0.2 ujson==4.0.2 unattended-upgrades==0.1 urllib3==1.26.7 virtualenv==20.8.0 visions==0.7.4 wasabi==0.10.1 wcwidth==0.2.5 webencodings==0.5.1 websocket-client==1.3.1 Werkzeug==2.0.2 wheel==0.37.0 widgetsnbextension==3.6.0 wrapt==1.12.1 xgboost==1.6.2 zipp==3.6.0 ``````

Next steps

No action items identified. Please copy ALL of the above output, including the lines containing only backticks, into your GitHub issue or comment. Be sure to redact any sensitive information.

Issue description

I am working on Databricks which uses AWS EC2 instances to train tf model using a custom training loop.

I am also using horovod for distributed training.

When I run a training job the tf.summary.scalars I am writing do not get written until after training is completed, which then fill up in the tf.events file as desired.

in my fit method I run:


for epoch in range(self.parameterManager.epochs):
            print(f"Epoch {epoch+1}")

            train_start_time = perf_counter()
            #train_loss = self.run_train(model, train_batch, train_steps=train_steps)
            train_metrics = self.run_train(model, train_batch, train_steps=train_steps)
            train_metric_results = [tm.result().numpy() for tm in train_metrics]
            train_loss = train_metrics[0].result()
            train_end_time = perf_counter()

            validation_metrics = self.run_validation(model, validation_batch, val_steps=val_steps)
            validation_metric_results = [vm.result().numpy() for vm in validation_metrics]
            validation_loss = validation_metrics[0].result()

            train_loss = hvd.allreduce(train_loss).numpy()
            validation_loss = hvd.allreduce(validation_loss).numpy()

            if hvd.rank() == 0:
                train_time = tf.constant(timedelta(seconds=train_end_time - train_start_time).total_seconds(), dtype=tf.float32)
                val_time = tf.constant(timedelta(seconds=perf_counter()-train_end_time).total_seconds(), dtype=tf.float32)

                self.log_summary(train_loss=train_loss, 
                                 train_time=train_time, 
                                 train_metrics=train_metric_results, 
                                 val_loss=validation_loss, 
                                 val_time=val_time, 
                                 val_metrics=validation_metric_results,
                                 step=epoch)

                self.train_writer.flush()
                self.test_writer.flush()

def log_summary(
        self, train_loss, train_time, train_metrics,
        val_loss, val_time, val_metrics,
        step

    ):

        if hvd.rank() == 0:
            print("Setting up tf summary writers...")
            print("Directory: " + self.log_dir + 'train')
            print("Directory: " + self.log_dir + 'test')

            if not self.train_writer:
                self.train_writer = tf.summary.create_file_writer(self.log_dir + 'train', max_queue=1)
            if not self.test_writer:
                self.test_writer = tf.summary.create_file_writer(self.log_dir + 'test', max_queue=1)

        with self.train_writer.as_default():
            tf.summary.scalar('loss', train_loss, step=step)
            tf.summary.scalar('time [s]', train_time, step=step)
            for m in range(len(self.metric_names)):
                _ = tf.summary.scalar(self.metric_names[m], train_metrics[m], step=step)

        with self.test_writer.as_default():
            tf.summary.scalar('loss', val_loss, step=step)
            tf.summary.scalar('time [s]', val_time, step=step)
            for m in range(len(self.metric_names)):
                tf.summary.scalar(self.metric_names[m], val_metrics[m], step=step)

So the function is executing eagerly.

Any idea why the tf.summary_writer is not writing until after training is completed? Could it be due to the lack of CPU access?

Thanks, Andrej

bileschi commented 1 year ago

Hi @andrejfd , Sorry you are having an issue. Can you clarify how you know that the file isn't being written until training terminates? Cloud filesystems can be tricky and often use layers of caching, so while the program thinks it's writing, the reader doesn't see the complete file until the file handle is closed.

andrejfd commented 1 year ago

Hi @bileschi . Yes I run ls in my log directory in a web terminal and don't see anything there. However the directories train and test are created, but within them there is no tf.summary file until the job (training) is complete.

I have seen that running self.train_writer.close() immediately populates the file into the directory, but then I obviously can't write to it anymore.

Is there a way to reopen the file and resume writing on the following epoch?

bileschi commented 1 year ago

I suspect what's going on is that the system is not making the file available until it's done writing them. To test this suspicion can you try the following? Can you create self.train_writer and self.test_writer in the for loop over epochs, and then hand them to the log_summary method? This should result in separate runs per epoch, but would at least test the suspicion that it is the file close operation which is triggering the availability, and not the end of the process creating the files.

andrejfd commented 1 year ago

You are correct. Here is the directory after 2 epochs. @bileschi

Screen Shot 2023-01-11 at 2 16 09 PM

bileschi commented 1 year ago

Ok. So then this is not an issue with TensorBoard so much as an issue with the file system. Is the workaround sufficient for your use case? If not you should reach out to the experts in Databricks and the Amazon file system. You could also have the system write more often than once per epoch, all the way to the extreme of writing one file per step (though you will get a whole lot of files, which may lead to other sorts of issues if you are creating files faster than one every 30s)

andrejfd commented 1 year ago

Yes this will work for now. Thanks for the quick help.