Implement sparse iterator

Benchmarking on bigger datasets shows that the 0-padding does have a big impact on performance.

Histograms of #bonds and #atoms per molecule confirm that max-padding can create a lot of overhead: screen shot 2016-10-28 at 10 18 19

The overhead occurs at three levels:

data-preprocessing
data-storage and loading
in the network

Because preprocessing only has to be done once, I won't focus on its performance.

Sparse matrices will definitely speed up the data-storage and loading process.

The 3rd level (training the network) is the most important as it is accounts for the most runtime. Because of the many matrix operations that take place at the GPU, the tensors should be in full format when on the GPU.

However, sparse matrices can still lead to increased performance. The goal is to pad the tensors per batch. So for each batch and each dimension, the molecule with the max at that axis will determine the size of that dimension for that batch.

Given the histograms, this will already results in significant speedup. An even bigger increase can be achieved when the tensors are sorted on their length dimension, and grouped in batches of equal size (with the possible risk of losing stochasticity).

Luckily, keras.model.Model's provide _generator versions of the fit, train and predict functions that will take a generator and allow for different batch-sizes.

I may have to adjust the layers slightly as well, to support various #atoms.

As for the implementation, unfortunately scipy sparse matrices only support 2D matrices.

This means that I will have to implement my own class. Because of the listed prioritisation, I will focus on quick retrieval, and possibility to sort on the length of the dimension. I will do so, even if at the cost of slower preprocessing.

keiserlab / keras-neural-graph-fingerprint

Implement sparse iterator #7