mila-iqia / blocks

A Theano framework for building and training neural networks
Other
1.16k stars 351 forks source link

Line delimited JSON for logs #927

Open bartvm opened 8 years ago

bartvm commented 8 years ago

So I've been frustrated with the current state of logging in Blocks for a while. Pickled Python dictionaries have problematic memory requirements and aren't ideal for analysis given all the issues that come with pickling. The SQLite backend is not very user-friendly and doesn't do well with non-tabular logs, and is not straightforward to analyze either if you don't know SQL.

I've been trying to find an alternative that addresses these issues and meets the following requirements:

Given these requirements, I want to propose storing logs in newline delimited JSON from now on, and to write a small toolbox to deal with this format. The advantages I imagine are as follows:

My proposal would be to write a mini-framework using this logging approach that is separate from Blocks, and then write Blocks extensions that use it. Basic functionality would be:

I'd love to know if people think this would meet all their logging needs. Logging is something that everyone ends up having to do, regardless of the framework they use, and I think it would be nice to have a clean and simple solution for it that is widely supported and used in the lab.

The main shortcoming of the format I can think of is that it doesn't allow to store any metadata about the model e.g. the learning rate, momentum velocity, etc. This makes it harder to do things like "plot minimum validation error vs. learning rate". Perhaps we could store this information in a separate JSON file? Or maybe it shouldn't be part of this mini-framework at all?

rizar commented 8 years ago

Sorry for the late reply!

Before giving any meaningful comments, I would like to ask the following question: do you plan to keep random access possible for this JSON-based log? I do not immediately see how this can be done, whereas for the in-memory and SQLite backends that we currently support this is trivially supported.

bartvm commented 8 years ago

Although I put this as a Blocks issue, because I can't think of a better place, this isn't really a drop-in replacement for either of the logging backends, and the proposal is very much supposed to be framework-agnostic. Blocks' use for logging is quite specific (since extensions use it to communicate), whereas here I'm aiming more for a minimalist "standard" that harmonizes the hacky ways people currently save and analyze logs in. Needing to unpickle a Python object and then loading that into a Pandas dataframe just to be able to do rudimentary analysis is overkill. Some people instead resort to grepping through the printed log statements, which is equally unsatisfactory. Having a standard logging format that is used throughout the lab would allow us to share simple data analysis and plotting scripts, live plotting servers, etc. regardless of the framework people use.

So, the question here is more about what format to store data in to disk and send it over the network so that it can easily be consumed and analyzed. Random access isn't really my concern then. To make this work in the context of Blocks, you'd probably have to combine an in-memory or SQLite backend with an extension that adheres to this specification of saving the logs to disk/sending them over the network.

I understand that it isn't entirely satisfactory form Blocks' perspective, because we now have a backend which is separate from the serialization, but I personally can't think of a solution that can easily support random access, streaming to disk, pushing over the network, while also producing files that are universally easy to analyze and interpret.

rizar commented 8 years ago

Okay, I guess we could support access only to a few last iterations in the respective Blocks wrapper.

More questions:

bartvm commented 8 years ago

Yes, iteration number would just be a key in the JSON log. That makes the format entirely agnostic to things like iterations or epochs. Tools built on top of this format would just use the default convention that one entry is one datapoint, with the option of using any key in the log as a time measure instead.

Although I understand the desire to want to store the state, I would not want to store it in the same file.

I don't see a problem with having two files e.g. model.log and model.config. Then you can still do things like

for model in $(ls *.config)
do
  learning_rate=$(jq .learning_rate < "$model")
  loss=$(jq -s 'min_by(.loss).loss' < "${filename%.*}.log")
  echo "Learning rate $learning_rate had loss $loss"
done

And do all the processing you want that way.

rizar commented 8 years ago

All right, let's assume for now that you convinced me that the status should go into a separate file.

My next question is whether it is creation of a new standalone repository for this functionality is justified. Yes, UNIX philosophy prescribes that a tool should do one thing and do it well. But there are also pragmatic considerations. The full software stack of MILA for now includes Blocks, Theano, Fuel, and in practice also often Blocks-extras. Now you also want to add Platoon and a logging framework to it. Considering the fact that it is the cutting edge version that is always used for research, this means that one will have to occasionally do git pull for 6 (!) repositories to keep his dependencies up-to-date. And then there is a change in Platoon which makes it incompatible with the cutting edge Blocks, and the user has to revert to a known compatible version, and so on and so forth.

That being said, my opinion is that creating a new repository which provides tools on top of Theano should be well justified. The scale of the project might be the reason. For instance Fuel is indeed quite a big project, with potentially huge codebase since it has to contain code for a lot of datasets. But even considering that, I am really not sure that we won by moving it out of Blocks. It costs us so much to maintain two documentations, two testing configurations, review two PRs simultaneously, etc.. And in the end of the day PyLearn2 never switched to it...

So I think it would be very practical to develop these new mini-frameworks in Blocks repository. We should keep an eye on the dependency graph of course.

On 5 December 2015 at 16:48, Bart van Merriënboer notifications@github.com wrote:

Yes, iteration number would just be a key in the JSON log. That makes the format entirely agnostic to things like iterations or epochs. Tools built on top of this format would just use the default convention that one entry is one datapoint, with the option of using any key in the log as a time measure instead.

Although I understand the desire to want to store the state, I would not want to store it in the same file.

  • It means that parsing suddenly becomes more complicated e.g. if we store the status as a JSON object, and it contains some of the same keys as in the log entries, parsing the file with jq or something would return strange things. All tools and scripts would have to be aware of this custom "header" in the file.
  • The entire UNIX philosophy goes out the window, because commands like tail, cat, grep etc. would result in non-conforming log files (the status part might be missing)
  • Logs can be assumed to be immutable, status usuallyisn't

I don't see a problem with having two files e.g. model.log and model.config. Then you can still do things like

for model in $(ls _.config)do learning_rate=$(jq .learning_rate < "$model") loss=$(jq -s 'minby(.loss).loss < "${filename%.}.log") echo "Learning rate $learning_rate had loss $loss"done

And do all the processing you want that way.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks/issues/927#issuecomment-162250229.

dwf commented 8 years ago

I don't think pylearn2 not switching to Fuel should be taken as evidence of the spinoff's failure. I've definitely seen people using Fuel without Blocks.

On Sun, Dec 6, 2015 at 12:39 AM, Dzmitry Bahdanau notifications@github.com wrote:

All right, let's assume for now that you convinced me that the status should go into a separate file.

My next question is whether it is creation of a new standalone repository for this functionality is justified. Yes, UNIX philosophy prescribes that a tool should do one thing and do it well. But there are also pragmatic considerations. The full software stack of MILA for now includes Blocks, Theano, Fuel, and in practice also often Blocks-extras. Now you also want to add Platoon and a logging framework to it. Considering the fact that it is the cutting edge version that is always used for research, this means that one will have to occasionally do git pull for 6 (!) repositories to keep his dependencies up-to-date. And then there is a change in Platoon which makes it incompatible with the cutting edge Blocks, and the user has to revert to a known compatible version, and so on and so forth.

That being said, my opinion is that creating a new repository which provides tools on top of Theano should be well justified. The scale of the project might be the reason. For instance Fuel is indeed quite a big project, with potentially huge codebase since it has to contain code for a lot of datasets. But even considering that, I am really not sure that we won by moving it out of Blocks. It costs us so much to maintain two documentations, two testing configurations, review two PRs simultaneously, etc.. And in the end of the day PyLearn2 never switched to it...

So I think it would be very practical to develop these new mini-frameworks in Blocks repository. We should keep an eye on the dependency graph of course.

On 5 December 2015 at 16:48, Bart van Merriënboer < notifications@github.com> wrote:

Yes, iteration number would just be a key in the JSON log. That makes the format entirely agnostic to things like iterations or epochs. Tools built on top of this format would just use the default convention that one entry is one datapoint, with the option of using any key in the log as a time measure instead.

Although I understand the desire to want to store the state, I would not want to store it in the same file.

  • It means that parsing suddenly becomes more complicated e.g. if we store the status as a JSON object, and it contains some of the same keys as in the log entries, parsing the file with jq or something would return strange things. All tools and scripts would have to be aware of this custom "header" in the file.
  • The entire UNIX philosophy goes out the window, because commands like tail, cat, grep etc. would result in non-conforming log files (the status part might be missing)
  • Logs can be assumed to be immutable, status usuallyisn't

I don't see a problem with having two files e.g. model.log and model.config. Then you can still do things like

for model in $(ls _.config)do learning_rate=$(jq .learning_rate < "$model") loss=$(jq -s 'minby(.loss).loss < "${filename%.}.log") echo "Learning rate $learning_rate had loss $loss"done

And do all the processing you want that way.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks/issues/927#issuecomment-162250229.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks/issues/927#issuecomment-162272674.

bartvm commented 8 years ago

I think that consideration shouldn't just be the size of the codebase. I think a bigger consideration is actually how large and changeable the public API is. Take picklable-itertools as an example, which has worked just fine as a separate package because by its very nature it has a completely stable API; function signatures and return values simply won't change, so there's little chance things will break. For this potential logging framework (and Platoon) the API would be very minimal i.e. in the order of 2-3 methods (preferably as simple as Log.__init__ and Log.log).

In the long term, if we end up switching from Theano to TensorFlow for example, having Fuel separate will be a big win. Similarly, this logging mini-framework would have no dependencies on Theano (or Blocks) whatsoever and could be more future-proof as well as more widely applicable as a separate package.

rizar commented 8 years ago

Picklable-itertools has an extremely clear purpose. I don't think that a general logging framework and a general parallelization frameworks will converge nearly as quickly.

Large part of Fuel will not be really useful for Tensorflow, since data processing is a part of the computation graph there. Also, if Fuel as it is now was in Blocks repository, but did not depend on Theano, we would have all the pros and none of the cons that we have now.

In short, I am really tired of this DJ experience. Btw, look at Tensorflow: they have logging, plotting and data preprocessing in the same framework as the computation graph and stuff. I assume they just found it more practical. On Dec 7, 2015 10:56 AM, "Bart van Merriënboer" notifications@github.com wrote:

I think that consideration shouldn't just be the size of the codebase. I think a bigger consideration is actually how large and changeable the public API is. Take picklable-itertools as an example, which has worked just fine as a separate package because by its very nature it has a completely stable API; function signatures and return values simply won't change, so there's little chance things will break. For this potential logging framework (and Platoon) the API would be very minimal i.e. in the order of 2-3 methods (preferably as simple as Log.init and Log.log).

In the long term, if we end up switching from Theano to TensorFlow for example, having Fuel separate will be a big win. Similarly, this logging mini-framework would have no dependencies on Theano (or Blocks) whatsoever and could be more future-proof as well as more widely applicable as a separate package.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks/issues/927#issuecomment-162565697.

bartvm commented 8 years ago

Within a company it's easier to create a monolithic system, or an intricate web of dependencies, because you have full-time software engineers responsible for resolving dependency issues, and your software is never used as part of a wider software stack that it needs to play nice with.

For open source software I think that the ideal should be different, and separate codebases for separate projects are important, to enforce modularity, avoid coupling, and to simplify use. For example, I personally don't like having to install two dozen dependencies for Blocks if I just want one particular, tiny functionality.

At Google everything is part of a single, supported software stack that each employee is expected to run, so our approaches aren't very comparable. As an alternative comparison, have a look at all the different repositories under the Torch umbrella.

I think the issues you highlight are basically issues with Python's package management, the way Blocks and Fuel are managed at the lab, our lack of a unified testing infrastructure, and GitHub's inability to make pull requests that span multiple repositories. I'd be much happier trying to find practical solutions to those problems, than throwing everything into a single, monolithic codebase which is likely to rot eventually, where interdependencies are easily sneaked in, the list of external dependencies grows with each new functionality, where it becomes harder to swap or replace particular parts of the stack as time goes on, and the number of issues/PRs per day becomes overwhelming because they are not separated depending on their functionality, etc.

If the management of multiple repositories really becomes too much of a burden, I'd go down the route of writing a few helper scripts instead, which allow us to make pull requests to Blocks and Fuel simultaneously and ensure that their test suites are then run against each other. Likewise, we should maybe find a way of not needing to duplicate .travis.yml and everything that comes with it for each repo.

dmitriy-serdyuk commented 8 years ago

Concerning practical things. I took a big log from our speech experiment and converted it into json lines format, here is a gist. Surprisingly, json takes a bit less space than cPickle and gzipped json is much smaller:

-rw-r--r--  1 dima  staff   109M Dec  9 15:05 main_log.json
-rw-r--r--  1 dima  staff    17M Dec  9 15:03 main_log.json.gz
-rwxr-xr-x  1 dima  staff   122M Dec  9 14:08 main_log.zip
dmitriy-serdyuk commented 8 years ago

It might be misleading: the log is not zipped despite .zip extension, it's just a cPickle file.

bartvm commented 8 years ago

Thanks, the size of the compressed JSON files is very encouraging!

rizar commented 8 years ago

Indeed! Dima, can you please check that from this particular pickle the compression was used.

Concerning one repo vs many: I think we should have a meeting next week and discuss it person. It is a crucial question. Also, I remember Bart expressing dissatisfaction that we only have administrative CCW meetings and no design meeting: this might just as well be a reason to have the first one. If you agree, propose a time and date.

On 9 December 2015 at 15:54, Bart van Merriënboer notifications@github.com wrote:

Thanks, the size of the compressed JSON files is very encouraging!

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks/issues/927#issuecomment-163389016.

dmitriy-serdyuk commented 8 years ago

You are right, it was saved with protocol 0, here are sizes with protocol 1, 2 and gzipped:

-rw-r--r--  1 dima  staff    77M Dec  9 16:55 main_log_1.zip
-rw-r--r--  1 dima  staff    73M Dec  9 16:57 main_log_2.zip
-rwxr-xr-x  1 dima  staff    26M Dec  9 14:08 main_log.zip.gz
bartvm commented 8 years ago

I guess that seems more realistic. I'm slightly curious though about the roles of NumPy arrays here; if 95% of the log data is array data, the actual numbers don't tell you much about the differences between the data formats. That said, a 35% decrease in size is still pretty good.

Also, I guess this is another argument for dropping pickling, because it's clear that even experienced users can forget to specify the right protocol and end up with unnecessarily large files.

I'll send an e-mail about a design meeting in a bit

dmitriy-serdyuk commented 8 years ago

There are no large numpy arrays in this log. I think, it is mostly field names and other meta information that consumes space.

On Wed, Dec 9, 2015 at 5:09 PM, Bart van Merriënboer < notifications@github.com> wrote:

I guess that seems more realistic. I'm slightly curious though about the roles of NumPy arrays here; if 95% of the log data is array data, the actual numbers don't tell you much about the differences between the data formats. That said, a 35% decrease in size is still pretty good.

Also, I guess this is another argument for dropping pickling, because it's clear that even experienced users can forget to specify the right protocol and end up with unnecessarily large files.

I'll send an e-mail about a design meeting in a bit

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks/issues/927#issuecomment-163412863.

bartvm commented 8 years ago

First code here: https://github.com/bartvm/mimir

Once it's more stable I can merge it into Blocks, but a separate repo was easier to start with (since it contains C and Cython code).

bartvm commented 8 years ago

I implemented all the important functionality, and am quite happy with it. In the end it does quite a bit of nice magic (like using threads in order to respond to clients who want data for live plotting or analysis, and using two buffers to make sure that writing the JSON data to disk is both efficient and well-compressed while still being relatively fail-safe). Integrating it into Blocks might not be entirely straightforward because of the nightmares from pickling file objects (and in this case, text wrappers over buffered streams over Cython type extensions that hold file handles...).

dmitriy-serdyuk commented 8 years ago

You are using bokeh 0.11 inside a notebook for live plotting, right? It seems a very nice solution to mila-udem/blocks-extras#35.

dwf commented 8 years ago

I think IPython notebook is a pretty heavy dependency for that, and if I understand it'd be a very much "assembly required" affair. I think a simple Flask app is probably the solution to mila-udem/blocks-extras#35 but if we can use gzlog for the storage backend that'd be great.

On Fri, Feb 12, 2016 at 11:49 AM, dmitriy-serdyuk notifications@github.com wrote:

You are using bokeh 0.11 inside a notebook for live plotting, right? It seems a very nice solution to mila-udem/blocks-extras#35 https://github.com/mila-udem/blocks-extras/issues/35.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks/issues/927#issuecomment-183407156.

bartvm commented 8 years ago

Yes, I think I use the latest Bokeh. IPython Notebook is not a dependency though, there's also the option of using the Bokeh server.

That said, sending/receiving data is entirely done through ZMQ, so you can use whatever plotting framework that you want.

dwf commented 8 years ago

Err, the "Bokeh server" (as in bokeh-server executable) doesn't exist anymore so probably you're using 0.10.

On Fri, Feb 12, 2016 at 12:33 PM, Bart van Merriënboer < notifications@github.com> wrote:

Yes, I think I use the latest Bokeh. IPython Notebook is not a dependency though, there's also the option of using the Bokeh server.

That said, sending/receiving data is entirely done through ZMQ, so you can use whatever plotting framework that you want.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks/issues/927#issuecomment-183422726.

bartvm commented 8 years ago

There's the bokeh serve command. No dashboard, but it still serves live plots if you go to http://localhost:5006/?bokeh-session-id=<session-id>

http://bokeh.pydata.org/en/latest/docs/user_guide/server.html#specifying-output-server

rizar commented 8 years ago

Looks nice!

As for Blocks integration, can not we create a new Logger object upon unpickling?

I am also curious if there is a support for multiple processes writing to the same file. I have recently had a usecase when it was necessary.

dmitriy-serdyuk commented 8 years ago

@rizar , I think we can do it, the logger appends to the file, so it shouldn't be a problem. A small trick is needed to fetch the last row (it's needed by some extensions) from the log, though. I'm trying to implement it now and I'll see if it works fine.

dwf commented 8 years ago

Atomicity of the appends is a concern, isn't it? It sounded from what Bart said that mimir is already in a place where it's brokering multiple client consumers, I don't know if it can do the same for producers.

On Fri, Feb 12, 2016 at 2:19 PM, dmitriy-serdyuk notifications@github.com wrote:

@rizar https://github.com/rizar , I think we can do it, the logger appends to the file, so it shouldn't be a problem. A small trick is needed to fetch the last row (it's needed by some extensions) from the log, though. I'm trying to implement it now and I'll see if it works fine.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks/issues/927#issuecomment-183455894.

bartvm commented 8 years ago

Writing to the same file from multiple processes isn't trivial. You'd need to lock the files to make things atomic like @dwf said, and I think that would be way too slow for many workers. Since ZMQ is already a dependency we could consider implementing something like this, but it would mean you need to designate a single master process to collect the log entries from the workers.

I think this is overkill though. Since the log is just line-delimited JSON, why not have each process write to a separate log and then do zcat *.jsonl.gz | gzip -c > full_log.jsonl.gz? Your log entries are going to be out of order, but you could just add a timestamp and pass them through jq -s -c 'sort_by(.timestamp)[]' if you need to.

dwf commented 8 years ago

:+1:

rizar commented 8 years ago

All right, that sounds reasonable.

Sorry for my being lazy to read the code, but can you please explain how are the consumers updated that a new record has been appended to the log?

On 12 February 2016 at 15:29, David Warde-Farley notifications@github.com wrote:

[image: :+1:]

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks/issues/927#issuecomment-183477284.

bartvm commented 8 years ago

It's simply a ZMQ publish and subscribe pair of sockets (PUB-SUB). The log simple pushes the log entries as JSON strings over the socket, and all the clients that are subscribed will receive them.

There's also a "persistent" client-server interface which is slightly more complex. In this case the server starts a separate thread. Each log entry is pushed over the PUB socket to the clients, but also to the other thread. This separate thread has a ROUTER socket. Now, if a client wants to join this happens:

  1. Client asks server thread for data (DEALER-ROUTER)
  2. The separate server thread replies to this client by sending each log entry that it has stored in memory (ROUTER-DEALER)
  3. Client start listening for new entries on the SUB socket.

Meanwhile, the main thread sends each log entry to the subscribed clients (PUB) as well as to the second separate thread where it will be stored for potential future clients to receive.

bartvm commented 8 years ago

So although catting a bunch of log files together is fine for analysis, it's a bad too bad that you lose the ability to live plot, print to a single output, etc. so I ended up writing a simple log for multi-GPU training anyway: https://github.com/bartvm/mimir/commit/39b940129a8fc03a53c9788e41955e449998f2e1

Still need to write docstrings, but the usage is really simple:

# Main process
from mimir import ServerLogger
# Accepts the same arguments as a normal Logger object
# blocks until all workers are done
logger = ServerLogger(filename='combined_log.jsonl.gz', stream=True)
# A worker
from mimir import RemoteLogger
# A remote logger only needs connection details, localhost:5555 by default
logger = RemoteLogger()
# Log away
logger.log({'iteration': 0})

The server logger receives the log entries, adds a unique ID that identifies each remote logger, and then handles it like it usually would (so it streams it, writes it to a log file, etc.)

Some current shortcomings: