Closed andrejfd closed 1 year ago
Hi @andrejfd , Sorry you are having an issue. Can you clarify how you know that the file isn't being written until training terminates? Cloud filesystems can be tricky and often use layers of caching, so while the program thinks it's writing, the reader doesn't see the complete file until the file handle is closed.
Hi @bileschi . Yes I run ls
in my log directory in a web terminal and don't see anything there. However the directories train
and test
are created, but within them there is no tf.summary
file until the job (training) is complete.
I have seen that running self.train_writer.close()
immediately populates the file into the directory, but then I obviously can't write to it anymore.
Is there a way to reopen the file and resume writing on the following epoch?
I suspect what's going on is that the system is not making the file available until it's done writing them. To test this suspicion can you try the following? Can you create self.train_writer
and self.test_writer
in the for loop over epochs, and then hand them to the log_summary method? This should result in separate runs per epoch, but would at least test the suspicion that it is the file close operation which is triggering the availability, and not the end of the process creating the files.
You are correct. Here is the directory after 2 epochs. @bileschi
Ok. So then this is not an issue with TensorBoard so much as an issue with the file system. Is the workaround sufficient for your use case? If not you should reach out to the experts in Databricks and the Amazon file system. You could also have the system write more often than once per epoch, all the way to the extreme of writing one file per step (though you will get a whole lot of files, which may lead to other sorts of issues if you are creating files faster than one every 30s)
Yes this will work for now. Thanks for the quick help.
Environment information (required)
Diagnostics
Diagnostics output
`````` --- check: autoidentify INFO: diagnose_tensorboard.py version 516a2f9433ba4f9c3a4fdb0f89735870eda054a1 --- check: general INFO: sys.version_info: sys.version_info(major=3, minor=9, micro=5, releaselevel='final', serial=0) INFO: os.name: posix INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='1122-034244-vmhw0z4b-10-28-75-209', release='5.4.0-1088-aws', version='#96~18.04.1-Ubuntu SMP Mon Oct 17 02:57:48 UTC 2022', machine='x86_64') INFO: sys.getwindowsversion(): N/A --- check: package_management INFO: has conda-meta: False INFO: $VIRTUAL_ENV: '/local_disk0/.ephemeral_nfs/envs/pythonEnv-a4b38f71-a5b9-4070-9845-3788f835b006' --- check: installed_packages INFO: installed: tensorboard==2.9.1 INFO: installed: tensorflow==2.9.1 INFO: installed: tensorflow-estimator==2.9.0 INFO: installed: tensorboard-data-server==0.6.1 --- check: tensorboard_python_version INFO: tensorboard.version.VERSION: '2.9.1' --- check: tensorflow_python_version 2023-01-11 15:33:15.361331: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. INFO: tensorflow.__version__: '2.9.1' INFO: tensorflow.__git_version__: 'v2.9.0-18-gd8ce9f9c301' --- check: tensorboard_data_server_version INFO: data server binary: '/databricks/python/lib/python3.9/site-packages/tensorboard_data_server/bin/server' INFO: data server binary version: b'rustboard 0.6.1' --- check: tensorboard_binary_path INFO: which tensorboard: b'/databricks/python3/bin/tensorboard\n' --- check: addrinfos socket.has_ipv6 = True socket.AF_UNSPEC =Next steps
No action items identified. Please copy ALL of the above output, including the lines containing only backticks, into your GitHub issue or comment. Be sure to redact any sensitive information.
Issue description
I am working on Databricks which uses AWS EC2 instances to train tf model using a custom training loop.
I am also using horovod for distributed training.
When I run a training job the
tf.summary.scalars
I am writing do not get written until after training is completed, which then fill up in thetf.events
file as desired.in my fit method I run:
So the function is executing eagerly.
Any idea why the
tf.summary_writer
is not writing until after training is completed? Could it be due to the lack of CPU access?Thanks, Andrej