mila-iqia / blocks

A Theano framework for building and training neural networks
Other
1.16k stars 352 forks source link

Implement database backend for log #470

Open bartvm opened 9 years ago

bartvm commented 9 years ago

This idea has been floating around for a while. It was recently brought up again as a way to speed up pickling in https://github.com/bartvm/blocks/issues/442#issuecomment-78056128, but after inspecting the pickle files the overhead of the current approach seems negligible (with 80,000 entries it loads in 0.12 seconds and takes around 3MB of space in an 10+GB pickle file).

The main advantage of using SQLite, for me at least, would be the ability to analyze and plot during training. I can't unpickle any of the checkpoints, because they need the GPU (which is being used for training), and the checkpoints take up to a minute to unpickle anway.

SQLite databases should scale okay for our needs I think. It would be nice if a user could use the same database for multiple experiments (perhaps in different tables)? So that you can collect the results from a series of experiments in a single file, and plot them together easily.

dwf commented 9 years ago

Would an SQL database take away from the flexibility of the log (which seems to be an arbitrary Python object that people can just add attributes to at will?). If so, another option would be something like MongoDB.

bartvm commented 9 years ago

Technically, we could pickle any Python object to a string and save it as a blob. I'm inclined to say we should support both the current backend and the SQLite one though, perhaps taking the SQLite as the default one for its advantages if it comes to easy analysis, while keeping the current basic implementation for when users want to log large, arbitrary Python objects which would get very slow and cumbersome to store in SQLite.

I looked at using a bunch of NoSQL databases, but I'm mainly worried about the need of having a database server up and running (and in that case I don't think we could use it as a default). The major advantage of a server/persistent DB though--and I think this alone makes it worth considering--is that you could technically set up a single server and log all of your experiments to it. You would automatically have a history of the results of every experiment you ever ran in a centralized place, and you could set up the server to receive results that are being sent from e.g. clusters, like with jobman. If we eventually get to implementing hyperparameter searching, the issue of results collection would already be solved.

If we don't intend to use the database as a central place to store all experiments forever, than Redis might be a good NoSQL option too. It's in-memory, so super-fast, and it's key-value structure might be a bit more apt than MongoDB's document-oriented structure (although I've never used MongoDB). For long-term storage you might have to dump the data to JSON at the end of the run though.

If we don't want servers, but really want NoSQL, there's BerkeleyDB, but it's Python interface seems quite badly documented and its interface is nowhere near as nice as Redis and MongoDB. The same goes for SQLite though, PyMongo and Redis are much nicer, transparently representing your entries as Python dictionaries instead of having to deal with cursors and such.

dwf commented 9 years ago

https://github.com/piskvorky/sqlitedict Might be worth a look.

bartvm commented 9 years ago

I looked at that actually, but I'm not sure if it suits our needs well. It basically assumes a key-value structure, while we have a key, key, value structure (iterations_done, key, value). From glancing over the code it seems to me like it would actually pickle the dictionary for each time step, which wouldn't be very useful. I might be wrong.

rizar commented 9 years ago

The main advantage of using SQLite, for me at least, would be the ability to analyze and plot during training. I can't unpickle any of the checkpoints, because they need the GPU (which is being used for training), and the checkpoints take up to a minute to unpickle anway.

Sorry for a late answer, but use save_separately='log', Luke!

rizar commented 9 years ago

In general I think that server-based DBs are preferable for the reasons already outlined above. But is it possible to launch a local user-level Mongo server when I run experiments on a completely isolated cluster?

bartvm commented 9 years ago

Technically I guess it's possible, but it's probably a lot more trouble than it's worth. This is the main/only advantage of SQLite I think, having a database file that doesn't require any setup. But then it's almost equivalent to a pickled Python object, except that you wouldn't have to wait until the Checkpoint extension is run.

This is actually why save_separately isn't ideal because I only save every few thousand time steps, and I'm too impatient to wait that long before being able to replot things :)

I think that these two backends would be a good start, if there turns out to be a real need we could implement SQLite (or BerkeleyDB) as a third one. If you're on an isolated cluster, or people who are just getting started, they can just use the pickled Python object, pretty straightforward. People who are more serious about their experiments, can setup MongoDB once, and then they'll have all the power that comes with having a proper database solution e.g. share it between users/an entire lab, send and receive data through the internet, etc.

dwf commented 9 years ago

@rizar Definitely possible to launch a mongod server without any root privileges, I've done it. The only issue is making sure it's on a server that is network-accessible from several cluster nodes. Usually running on the head node is the best option, though that might make some admins mad.

jbornschein commented 9 years ago

Just to mention it here: For quite some years I've have used extendable arrays in H5/HDF-files to store training logs. Just appending a row for all 'channels' at the end of each epoch. The only constraint was that the datatype and the shape of the logged entities needs to stay the same after an entity has been logged for the first time.

It allows for a) on the fly compression b) convenient access (e.g. h5ls on the command line) and you can easily replot whenever a epoch is finished.

bartvm commented 9 years ago

Would there be any advantages of HDF5 over SQLite? Compression is nice, but I don't think that's crucial. SQLite is weakly typed, so you won't have that problem, and it's file-based with command line interfaces available as well.

So the way I see it that still leaves us with 3 options, although we could potentially implement all 3 of them on the long term. Did I miss anything? My vote is still for MongoDB as I implemented in #476, I think that this is the most powerful backend in the end.

Python object

What we currently have.

Advantages

Disadvantages

Flat file "database

Advantages

Disadvantages

Database

Advantages

Disadvantages

rizar commented 9 years ago

I do not think that there exists a universal solution and that it is needed at all. It should pretty simple to write yet another backend after Bart's cleanup. I would probably use the MongoDB backend if it ends up being user friendly enough.

bartvm commented 9 years ago

Agreed. It might be worthwhile to develop a basic version of all three simultaneously though, just to make sure that we can keep the interface between all three completely coherent.