tensorflow / tensorboard

TensorFlow's Visualization Toolkit
Apache License 2.0
6.67k stars 1.65k forks source link

database exported by --db_import doesn't include all summaries shown by --logdir #1558

Open ajbouh opened 5 years ago

ajbouh commented 5 years ago

When I export a database using --db_import and then start a tensorboard instance with --db (pointed at the same file), I don't see all the same tensor summaries that I see when I run with --logdir directly.

Running SQL queries with the sqlite3 command line tool on macOS seems to indicate that the expected summaries aren't ever written to the database file.

nfelt commented 5 years ago

Hi @ajbouh, thanks for trying out the experimental --db_import option. It's not yet a full replacement for the regular --logdir mode, so there may be some discrepancies like what you observed.

Just to ensure I understand, is the issue that you're originally 1) running tensorboard --db_import --logdir <path-to-logs> to import data, and then 2) running tensorboard --db <path-to-tmp-db> to view it again, and getting different data between 1 and 2? Or is that they're the same, but they differ from 3) running tensorboard --logdir <path-to-logs> with no DB-related flags?

If the latter, this is expected - for example, image and audio summaries are not yet correctly handled by DB mode (and possibly others).

If the former or if it's behaving in some other unexpected way, could you elaborate on which summaries aren't appearing? E.g. which dashboards are affected?

ajbouh commented 5 years ago

I didn't look at the dashboard during the import process. That is, I only compared 2 and 3. My observation was that the scalars, distributions, and histograms views were all missing data. Essentially no types of data were faithfully imported to the database. I confirmed this by directly inspecting the database itself.

nfelt commented 5 years ago

The scalar plugin should work, and I'd have thought distributions/histograms would as well. Do you happen to have any minimal reproduction you could share? E.g. an event file (if the data is okay to share) or at least a log of the exact command lines you ran?

bersbersbers commented 5 years ago

I might add to this issue related ones: I compared 1) to 3)

  1. I noticed that the order of runs was different. For example, I have runs "2018-04-20 StructureValidatation Old" and "2018-04-20 StructureValidatation New". In 3), "Old" would be shown before "New", probably because I ran "Old" first, hinting at date ordering. In 1), however, "New" would be shown before "Old", hinting at alphabetic ordering. (And with the order, the lines colors of runs change, too.)

  2. I did not notice important differences in the data imported (scalar metrics only). 3) seems to import data with some variable step size (that is, batch-level metrics are shown for maybe every 100-300 steps or so), while 1) seems to display every single step.

  3. I did notice a huge performance difference in the web interface when running 1) while training another run, compared to 3). This may be related to updating the database and/or the much finer granularity of the data.

I have about 2.5 GB of event files for about 40 runs.

nfelt commented 5 years ago

@bersbersbers

  1. Yes, this is known. Right now it timestamps imported runs by the import time, so it's essentially arbitrary. Before this is released for general consumption we'll make it reflect the actual event file initiation time.

  2. Yes, also known - there's no sampling with --db_import right now.

  3. Can you tell me more about the performance impact you saw? I assume you mean it was slower with 1) aka with --db_import, but how much slower and what parts? This is a bit surprising to me, since I wouldn't generally expect training another run to make a difference with gigabytes worth of data to load. I'd have thought it would be roughly the same starting up a tensorboard against the 2.5 GB with no new run being written as it is starting it up when a new run is being written.

bersbersbers commented 5 years ago

@nfelt good that you know 1+2.

  1. Yes, --db-import is slower. But it has nothing to do within training in parallel, that has finished since, and it is still slower.

To add some information, this is how I started my current two instances (in the same folder): image Both are idle with the browser closed, so they should both have finished indexing.

Opening localhost:6006 (w/ --db-import) and localhost:6011 (w/o --db-import) at the same time, it takes about a second for 6011 to display all curves. Switching between curves is fast, too (less than 1s to hide/show individual lines, etc.). The peak CPU goes to about 3%, which is one of my 32 hyperthreading cores, and drops to zero pretty quickly.

With 6006 (w/ --db-import), the GUI itself is equally fast (that is, initial loading of localhost:6006 and navigating, selecting runs, filtering metrics etc.) What takes a while, however, is loading the curves. Sometimes, I see the spinner run in most of the charts for minutes before seeing any curve, and then they appear one by one, in very slow succession. Interestingly, the spinners are only in those charts that will eventually show some data for the selected runs. If none of the selected runs contains any data for a chart, this chart is shown empty immediately, without a spinner. Eventually, all charts will be filled with data, but it takes a lot of patience.

Peak CPU is as high as 10, sometimes 16, and even closing the browser window, it does not drop to zero for minutes or tens of minutes.

nfelt commented 5 years ago

@bersbersbers Thanks for the additional detail! I'm guessing this might be related to returning very high numbers of data points for each plot in --db-import because there isn't any sampling implemented yet. Do you know how many steps of data these curves have?

bersbersbers commented 5 years ago

@nfelt I believe so, too. I write out batch-level statistics, and the last step is labeled 300k in most runs. While I can't be certain, because the labels for adjacent points are identical - if you say --db-import does not do sampling, then this is the number of imported steps, too.