Factor out datasets into a new project

bartvm commented 9 years ago

This is something suggested by @pbrakel during the tutorial this afternoon, which I think is very much worth considering.

The idea simply is that our data iteration framework is completely independent from the rest of the library so far, and that it's functionality extends far beyond Blocks, and could potentially be used e.g. in Pylearn2 but also other completely unrelated projects which need to train models on data. We could factor datasets out of the repository into a new project.

Although our aims are slightly different, I think it's very much worth considering joining efforts with skdata. That library is focused strongly on downloading and providing an interface to the data (and has support for quite a few!) while we have a stronger focus on iteration and transformation. If we could combine the two, we'd have an excellent framework to build on.

rizar commented 9 years ago

Yep, datasets might be useful outside blocks. But I would first wait until thing settle a bit before creating splitting blocks into pieces. It is much easier to start participation in development when you have to clone only one repo, or only two because we have to start blocks-contrib. I would not add more structural complexity for now.

bartvm commented 9 years ago

Thinking about it a bit more, this would be a great project to undertake with Pylearn2, and I'd be curious to hear what other Pylearn2 stakeholders think (@lamblin, @memimo, @dwf, @vdumoulin). We've been talking about refactoring the datasets framework in Pylearn2, implementing transformers similar to Blocks' data streams, refactoring iterators, etc. It seems silly to duplicate all this work, because we have very similar goals:

Easy ways of interfacing with new datasets
(Chained) iterators for "preprocessing" like caching, whitening, n-gram extraction, etc.
Both frameworks have the pretty unique requirement that datasets and iterators need to be serializable for checkpointing
Both Blocks (#105) and Pylearn2 (dataset-get) have had plans to automate dataset downloading.

In fact, the only fundamental difference I see is that Pylearn2 has some concept of "spaces" describing the semantics of the data throughout the pipeline. If we can factor that out or make it optional, we could make this a real joint undertaking. That will be more productive than Blocks doing its own framework, and Pylearn2 wrapping parts of it, and it will definitely be more efficient than duplicating work.

So my suggestion for a rough roadmap:

Come up with a name for this new framework (that doesn't associate it with Blocks or Pylearn2 directly)
Start a new repository where we take the code of Blocks as a starting point (I might be biased, but I feel that it's a cleaner and more flexible base to start from)
Decide on terminology: Blocks talks about datasets, data streams, iteration schemes, request iterators, and data iterators. Pylearn2 on the other hand talks about datasets, transformers, subset iterators, subset iterators, and dataset iterators.
Port the most important datasets from Pylearn2 to the new framework; we have only two in Blocks right now.
Port batch-wise preprocessors (transformers) from Pylearn2 (whitening, binarizer,
Support picklable_itertools's development, which we should use heavily.

memimo commented 9 years ago

+++ I'm super supportive of having multiple core libraries, each responsible for some non overlapping functionalities. A simplistic scenario similar to what you suggested could be:

1-A new data pipeline library 2-Pylearn2 for models that work on dataset with dense design matrix 3-Blocks for models that work on sequential data. (I'm not that familiar yet, so sorry for simplification if that's not the case.) 4-A new library for training algorithms (But of course this would be too much work now, just an idealistic scenario suggestion.)

bartvm commented 9 years ago

Great!

Blocks has grown quite a bit beyond handling recurrent networks, and we have basic support for e.g. feedforward networks and convolutional networks as well now. In general, the functionality of Blocks and Pylearn2 are becoming more and more overlapping, although with very different approaches; Blocks being more of a lightweight Theano toolkit, while Pylearn2 provides high-level abstractions, a YAML interface, etc.

Factoring out algorithms would be a possible long-term objective, but would require a lot of work to make work very well, so yeah, dream-scenario for the foreseeable future I'm afraid. But I think that having a data pipeline library is very feasible.

vdumoulin commented 9 years ago

I'm in favour of having a library devoted to datasets. For the name, I view that library as a way to provide fuel (data) to machine learning models, so something like Fuel (although apparently there already is a PHP library with that name) would work well, since it's simple yet evocative.

My view of how Blocks, Pylearn2 and Theano fit together is as follows:

Theano is a symbolic numerical computation library. Its building blocks consists of symbolic variables (e.g. scalar, vector, matrix and tensor).
Blocks is a toolkit for building Theano expressions aimed at building machine learning models (a.k.a. computation graphs) easier while still retaining low-level ability to manipulate Theano expressions directly. Its building blocks consists of simple operations on symbolic variables (e.g. affine transformation, convolution, nonlinearity, recurrent applications).
Pylearn2 is a machine learning prototyping library which allows to handle all the boilerplate code that one needs to write in order to start training a model, as well as mix-and-match model parts to form new ideas. Its building blocks are simple models (e.g. fully-connected MLP, convnets, probability distributions).

I don't think Blocks should bother with algorithms, monitoring and models; that should be handled by a Pylearn2-like library.

dwf commented 9 years ago

I disagree on practical grounds more than principle: blocks is already doing a better job of those things in many respects with less code that's better documented. On Feb 12, 2015 11:42 AM, "vdumoulin" notifications@github.com wrote:

I'm in favour of having a library devoted to datasets. For the name, I view that library as a way to provide fuel (data) to machine learning models, so something like Fuel (although apparently there already is a PHP library with that name) would work well, since it's simple yet evocative.

My view of how Blocks, Pylearn2 and Theano fit together is as follows:

Theano is a symbolic numerical computation library. Its building blocks consists of symbolic variables (e.g. scalar, vector, matrix and tensor).

Blocks is a toolkit for building Theano expressions aimed at building machine learning models (a.k.a. computation graphs) easier while still retaining low-level ability to manipulate Theano expressions directly. Its building blocks consists of simple operations on symbolic variables (e.g. affine transformation, convolution, nonlinearity, recurrent applications).

Pylearn2 is a machine learning prototyping library which allows to handle all the boilerplate code that one needs to write in order to start training a model, as well as mix-and-match model parts to form new ideas. Its building blocks are simple models (e.g. fully-connected MLP, convnets, probability distributions).

I don't think Blocks should bother with algorithms, monitoring and models; that should be handled by a Pylearn2-like library.

— Reply to this email directly or view it on GitHub https://github.com/bartvm/blocks/issues/255#issuecomment-74104529.

bartvm commented 9 years ago

I like Fuel! We never bothered to check for Blocks actually, and it turned out later that the name was already taken :( We'll have to put it on PyPI eventually as theano-blocks I guess.

The split you propose (although I think we're talking horizons here, on the short-term I think we have enough work on our hands) is not as clear as it should be for separate packages I think. Mostly because a fully-connected MLP is in effect just a single brick (or collection of bricks).

What I think would be a possible, clearer distinction in the future is:

Theano as you described
Blocks as you described, but also with interfaces to easily build MLPs, convnets, probability distributions, which are just bricks that take a Theano input and provide a Theano output after all.
"Fuel" as a data pipeline
"Pylearn2" with the main loop, extensions and the algorithms (based on the current Blocks implementatinos). This will basically bring together the models from Blocks with the data from "Fuel " to actually perform training, log/plot the results, run on clusters, etc. If you need a Model abstraction for some reason, you could put it here as a minimal wrapper around Blocks's models, but I can't think of many important use cases.

bartvm commented 9 years ago

Closing since we now have Fuel!

mila-iqia / blocks

Factor out datasets into a new project #255