tensorflow / tensorboard

TensorFlow's Visualization Toolkit
Apache License 2.0
6.73k stars 1.66k forks source link

Parsing HParams log file #3091

Closed sumanthratna closed 4 years ago

sumanthratna commented 4 years ago

Environment information (required)

Diagnostics

Diagnostics output `````` --- check: autoidentify Traceback (most recent call last): File "", line 470, in main File "", line 78, in wrapper File "", line 148, in autoidentify File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/inspect.py", line 973, in getsource lines, lnum = getsourcelines(object) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/inspect.py", line 955, in getsourcelines lines, lnum = findsource(object) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/inspect.py", line 786, in findsource raise OSError('could not get source code') OSError: could not get source code --- check: general INFO: sys.version_info: sys.version_info(major=3, minor=6, micro=8, releaselevel='final', serial=0) INFO: os.name: posix INFO: os.uname(): posix.uname_result(sysname='Darwin', nodename='SumanthRatnasAir', release='19.3.0', version='Darwin Kernel Version 19.3.0: Sun Dec 8 22:27:29 PST 2019; root:xnu-6153.80.8.0.1~15/RELEASE_X86_64', machine='x86_64') INFO: sys.getwindowsversion(): N/A --- check: package_management INFO: has conda-meta: False INFO: $VIRTUAL_ENV: '/Users/suman/.local/share/virtualenvs/mlencrypt-research-PwGoqeJm' --- check: installed_packages INFO: installed: tensorboard==2.0.2 INFO: installed: tensorflow==2.0.0 INFO: installed: tensorflow-estimator==2.0.1 --- check: tensorboard_python_version INFO: tensorboard.version.VERSION: '2.0.2' --- check: tensorflow_python_version INFO: tensorflow.__version__: '2.0.0' INFO: tensorflow.__git_version__: 'v2.0.0-rc2-26-g64c3d382ca' --- check: tensorboard_binary_path INFO: which tensorboard: b'/Users/suman/.local/share/virtualenvs/mlencrypt-research-PwGoqeJm/bin/tensorboard\n' --- check: addrinfos socket.has_ipv6 = True socket.AF_UNSPEC = socket.SOCK_STREAM = socket.AI_ADDRCONFIG = socket.AI_PASSIVE = Loopback flags: Loopback infos: [(, , 6, '', ('::1', 0, 0, 0)), (, , 6, '', ('127.0.0.1', 0))] Wildcard flags: Wildcard infos: [(, , 6, '', ('::', 0, 0, 0)), (, , 6, '', ('0.0.0.0', 0))] --- check: readable_fqdn INFO: socket.getfqdn(): 'SumanthRatnasAir' --- check: stat_tensorboardinfo INFO: directory: /var/folders/j8/_wc022w91rvctlq5_t8xqcvc0000gn/T/.tensorboard-info INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=72099577, st_dev=16777220, st_nlink=2, st_uid=501, st_gid=20, st_size=64, st_atime=1577728248, st_mtime=1577747090, st_ctime=1577747090) INFO: mode: 0o40777 --- check: source_trees_without_genfiles INFO: tensorboard_roots (1): ['/Users/suman/.local/share/virtualenvs/mlencrypt-research-PwGoqeJm/lib/python3.6/site-packages']; bad_roots (0): [] --- check: full_pip_freeze INFO: pip freeze --all: absl-py==0.9.0 ansible-cmdb==1.30 astor==0.8.1 cachetools==4.0.0 certifi==2019.11.28 cffi==1.13.2 chardet==3.0.4 cheroot==8.2.1 CherryPy==18.5.0 Click==7.0 cloudpickle==1.1.1 cryptography==2.8 cycler==0.10.0 decorator==4.4.1 gast==0.2.2 gitdb2==2.0.6 GitPython==3.0.5 google-auth==1.10.0 google-auth-oauthlib==0.4.1 google-pasta==0.1.8 grpcio==1.26.0 h5py==2.10.0 idna==2.8 importlib-resources==1.0.2 jaraco.classes==3.1.0 jaraco.collections==3.0.0 jaraco.functools==3.0.0 jaraco.text==3.2.0 jsonxs==0.6 Keras-Applications==1.0.8 Keras-Preprocessing==1.1.0 kiwisolver==1.1.0 Mako==1.1.0 Markdown==3.1.1 MarkupSafe==1.1.1 matplotlib==3.1.2 more-itertools==8.0.2 numpy==1.18.0 oauthlib==3.1.0 opt-einsum==3.1.0 pandas==0.25.3 pip==19.3.1 portend==2.6 protobuf==3.11.2 psutil==5.6.7 pyasn1==0.4.8 pyasn1-modules==0.2.7 pycparser==2.19 pyparsing==2.4.6 python-dateutil==2.8.1 pytz==2019.3 PyYAML==5.2 requests==2.22.0 requests-oauthlib==1.3.0 rsa==4.0 scipy==1.4.1 seaborn==0.9.0 setuptools==42.0.1 six==1.13.0 smmap2==2.0.5 tempora==2.0.0 tensorboard==2.0.2 tensorflow==2.0.0 tensorflow-estimator==2.0.1 termcolor==1.1.0 urllib3==1.25.7 ushlex==0.99.1 Werkzeug==0.16.0 wheel==0.33.6 wrapt==1.11.2 zc.lockfile==2.0 ``````

Next steps

No action items identified.

Issue description

I want to export all of my hyperparameters data to a CSV file. Since #3060 won't be merged into TensorBoard for a while, I decided I'd use tensorflow.python.summary.summary_iterator.summary_iterator to manually export to a CSV. Here's a very simple script:

from tensorflow.python.summary.summary_iterator import summary_iterator
si = summary_iterator(
    "snowy/snowy1/events.out.tfevents.1577067052.snowy.1378.5.v2")
count = 0
for e in si:
    count += 1
    print(str(count) + ': ' + str(e.summary.value))

It outputs:

1: []
2: [tag: "_hparams_/experiment"
metadata {
  plugin_data {
    plugin_name: "hparams"
    content: "\022\332\007\";\n\013update_rule \001**\n\016\032\014anti_hebbian\n\t\032\007hebbian\n\r\032\013random_walk\"\035\n\005tpm_k \0032\022\t\000\000\000\000\000\000\020@\021\000\000\000\000\000\0008@\"\035\n\005tpm_n \0032\022\t\000\000\000\000\000\000\020@\021\000\000\000\000\000\0008@\"\035\n\005tpm_l \0032\022\t\000\000\000\000\000\000\020@\021\000\000\000\000\000\0008@\"1\n\nkey_length \003*!\n\t\021\000\000\000\000\000\000`@\n\t\021\000\000\000\000\000\000h@\n\t\021\000\000\000\000\000\000p@\"\333\005\n\tiv_length \003*\313\005\n\t\021\000\000\000\000\000\000\000\000\n\t\021\000\000\000\000\000\000\020@\n\t\021\000\000\000\000\000\000 @\n\t\021\000\000\000\000\000\000(@\n\t\021\000\000\000\000\000\0000@\n\t\021\000\000\000\000\000\0004@\n\t\021\000\000\000\000\000\0008@\n\t\021\000\000\000\000\000\000<@\n\t\021\000\000\000\000\000\000@@\n\t\021\000\000\000\000\000\000B@\n\t\021\000\000\000\000\000\000D@\n\t\021\000\000\000\000\000\000F@\n\t\021\000\000\000\000\000\000H@\n\t\021\000\000\000\000\000\000J@\n\t\021\000\000\000\000\000\000L@\n\t\021\000\000\000\000\000\000N@\n\t\021\000\000\000\000\000\000P@\n\t\021\000\000\000\000\000\000Q@\n\t\021\000\000\000\000\000\000R@\n\t\021\000\000\000\000\000\000S@\n\t\021\000\000\000\000\000\000T@\n\t\021\000\000\000\000\000\000U@\n\t\021\000\000\000\000\000\000V@\n\t\021\000\000\000\000\000\000W@\n\t\021\000\000\000\000\000\000X@\n\t\021\000\000\000\000\000\000Y@\n\t\021\000\000\000\000\000\000Z@\n\t\021\000\000\000\000\000\000[@\n\t\021\000\000\000\000\000\000\\@\n\t\021\000\000\000\000\000\000]@\n\t\021\000\000\000\000\000\000^@\n\t\021\000\000\000\000\000\000_@\n\t\021\000\000\000\000\000\000`@\n\t\021\000\000\000\000\000\200`@\n\t\021\000\000\000\000\000\000a@\n\t\021\000\000\000\000\000\200a@\n\t\021\000\000\000\000\000\000b@\n\t\021\000\000\000\000\000\200b@\n\t\021\000\000\000\000\000\000c@\n\t\021\000\000\000\000\000\200c@\n\t\021\000\000\000\000\000\000d@\n\t\021\000\000\000\000\000\200d@\n\t\021\000\000\000\000\000\000e@\n\t\021\000\000\000\000\000\200e@\n\t\021\000\000\000\000\000\000f@\n\t\021\000\000\000\000\000\200f@\n\t\021\000\000\000\000\000\000g@\n\t\021\000\000\000\000\000\200g@\n\t\021\000\000\000\000\000\000h@\n\t\021\000\000\000\000\000\200h@\n\t\021\000\000\000\000\000\000i@\n\t\021\000\000\000\000\000\200i@\n\t\021\000\000\000\000\000\000j@\n\t\021\000\000\000\000\000\200j@\n\t\021\000\000\000\000\000\000k@\n\t\021\000\000\000\000\000\200k@\n\t\021\000\000\000\000\000\000l@\n\t\021\000\000\000\000\000\200l@\n\t\021\000\000\000\000\000\000m@\n\t\021\000\000\000\000\000\200m@\n\t\021\000\000\000\000\000\000n@\n\t\021\000\000\000\000\000\200n@\n\t\021\000\000\000\000\000\000o@\n\t\021\000\000\000\000\000\200o@\n\t\021\000\000\000\000\000\000p@*\024\n\014\022\ntime_taken\032\004Time*\027\n\013\022\teve_score\032\010Eve sync"
  }
}
]

I don't really care about the first (empty) list that's printed, but the second list has what I want. Printing e.summary.value.tag and e.summary.value.metadata both result in AttributeError: 'google.protobuf.pyext._message.RepeatedCompositeCo' object has no attribute. e.summary.value[0] and e.summary.value[1] result in IndexError: list index out of range.

According to summary.proto it looks like metadata and tag should both be valid attributes.

NOTE: this looks like this should be for Stack Overflow but at the beginning of the output I get:

WARNING:tensorflow:From .../python3.6/site-packages/tensorflow_core/python/summary/summary_iterator.py:68: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`

Maybe a method was removed because of deprecation and this altered the functionality of summary_iterator?

wchargin commented 4 years ago

Can’t know for sure without seeing the event file, but it looks like what’s happening is that e.summary.value[0] is raising on the first iteration, where e.summary.value is empty, but it would have actually worked on the second iteration if the code had gotten there. It’s expected that e.summary.value.tag should raise, because value is a repeated field, represented as a Python sequence (list-like object). Using e.summary.value[i].tag should work as long as the index i is within bounds.

Off the top of my head (untested), try something like the following?

from tensorflow.python.summary.summary_iterator import summary_iterator

si = summary_iterator(
    "snowy/snowy1/events.out.tfevents.1577067052.snowy.1378.5.v2")

count = 0
for event in si:
    for value in event.summary.value:
        count += 1
        print(str(count) + ': ' + str(value))

Please note that this is all doubly unsupported: tensorflow.python.* is a private namespace, and summary_iterator is deprecated. There aren’t any officially supported ways to read the event files by hand. Also, this is really more of a question about how to use the Python APIs for protocol buffers. I’m happy to provide some ad hoc support where it’s feasible, but we don’t promise to maintain forward compatibility such that any scripts that you write will necessarily work with data written in the future.

sumanthratna commented 4 years ago

Thanks for the help, your script worked. Here's what I'm running:

import tensorflow as tf
from tensorflow.python.summary.summary_iterator import summary_iterator

si = summary_iterator(
    "snowy/snowy1/events.out.tfevents.1577067052.snowy.1378.5.v2")

count = 0
for event in si:
    for value in event.summary.value:
        count += 1
        data = tf.io.decode_raw(value.metadata.plugin_data.content, tf.float64)
        tf.print(data)

and I'm getting tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to DecodeRaw has length 989 that is not a multiple of 8, the size of double [Op:DecodeRaw]

Changing it to data = tf.io.decode_raw(value.metadata.plugin_data.content, tf.uint8) results in [18 218 7 ... 121 110 99] which is great but my data includes decimals.

Changing it to data = tf.compat.as_str_any(value.metadata.plugin_data.content) results in UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 1: invalid continuation byte.

Here's the log file in case you need it to help me: https://gofile.io/?c=kYoD4c. Thanks again!

wchargin commented 4 years ago

The decode_raw raw is used to deserialize a sequence of packed values, like this:

>>> tf.io.decode_raw(b"\x34\x12\x78\x56", out_type="int16").numpy()
array([ 4660, 22136], dtype=int16)
>>> [0x1234, 0x5678]
[4660, 22136]

But the contents of an hparams metadata.plugin_data.content are an encoded HParamsPluginData protocol buffer, which should be deserialized with HParamsPluginData.FromString(bytestring). Something like:

import tensorflow as tf
from tensorflow.python.summary.summary_iterator import summary_iterator
from tensorboard.plugins.hparams import plugin_data_pb2

si = summary_iterator(
    "snowy/snowy1/events.out.tfevents.1577067052.snowy.1378.5.v2"
)

count = 0
for event in si:
    for value in event.summary.value:
        count += 1
        proto_bytes = value.metadata.plugin_data.content
        plugin_data = plugin_data_pb2.HParamsPluginData.FromString(proto_bytes)
        if plugin_data.HasField("experiment"):
            print(
                "Got experiment metadata with %d hparams and %d metrics"
                % (
                    len(plugin_data.experiment.hparam_infos),
                    len(plugin_data.experiment.metric_infos),
                ),
            )
        elif plugin_data.HasField("session_start_info"):
            print(
                "Got session start info with concrete hparam values: %r"
                % (dict(plugin_data.session_start_info.hparams),)
            )

(Again, a quick check shows that this appears to work with latest tf-nightly on my machine, but we don’t officially support these reading APIs.)

sumanthratna commented 4 years ago

Thanks again for the help, your script does work. It turns out that the hparams data isn't just stored in the "main" file; in each run log, there's data that says which hparams were used (I didn't know this).

Also, to be honest, I don't care about compatibility in the future because I'm at a point where I just want the data.

Thanks for all your help and patience with this and other issues.

wchargin commented 4 years ago

Yep, exactly. In a typical hparams logdir structure, the top-level directory has the experiment description, and each run has its own events file with the concrete hparams values used for that run.

Also, to be honest, I don't care about compatibility in the future because I'm at a point where I just want the data.

Yep, totally reasonable. :-)

Thanks for all your help and patience with this and other issues.

My pleasure.

sumanthratna commented 4 years ago

This worked great until a few days ago (I think). To anybody looking for a way to parse HParams log files, they changed the API:

module 'tensorflow_core._api.v2.io.gfile' has no attribute 'get_filesystem'

JimAva commented 4 years ago

Hi there - I used the code mentioned here by @wchargin and am able to export all the HParams hyperparameters from the "session_start_info" but can't find any of the actual 'experiment' results. Basically, using the code provided by @wchargin and inserting an event file name from a 'run-???' folder into the 'si' statement, the 'if' statement below has not been true when I've run it against over 1000 event files. Any suggestion would be greatly appreciated!

    if plugin_data.HasField("**experiment**"):
        print(
            "Got experiment metadata with %d hparams and %d metrics"
            % (
                len(plugin_data.experiment.hparam_infos),
                len(plugin_data.experiment.metric_infos),
            ), 
wchargin commented 4 years ago

@sumanthratna: I’m not sure off the top of my head why that might be, sorry. The summary_iterator module still exists at head.

wchargin commented 4 years ago

@JimAva: The top-level experiment metadata will only be set if you used the hp.hparams_config summary function, and will appear in the logdir to which that summary was written. Typically, this is a logdir one level above all the runs, so if your directory structure includes runs like logs/mnist/lr=1,fc=2 then logs/mnist is probably the directory that contains the relevant events file.

JimAva commented 4 years ago

Thank you @wchargin - another question, is it possible to 'only' generate/log the HParams hyperparameters & metrics? I have thousands of hyperparameters combination that I'd to run and need to speed the process up.

JimAva commented 4 years ago

I got it figured out. I was logging too much stuff and I changed these parameters to FALSE: write_graph=False, write_grads=False, write_images=False.

j3soon commented 2 years ago

I also have a hard time parsing hparams events. I wrote a small python package recently (tbparse) that parses the hparams events for you and store it inside a pandas DataFrame for later use:

from tbparse import SummaryReader
log_dir = "<PATH_TO_EVENT_FILE_OR_DIRECTORY>"
reader = SummaryReader(log_dir)
hp = reader.hparams
print(hp)

I also wrote some documentation that provides examples for different use cases.