lmjohns3 / theanets

Neural network toolkit for Python
http://theanets.rtfd.org
MIT License
328 stars 73 forks source link

Support of text input which is a sparse vector for one text object? #14

Closed byzhang closed 10 years ago

byzhang commented 10 years ago

Or can you show me where to extend the code to support it, if not yet?

lmjohns3 commented 10 years ago

As long as you can express your data as a vector, you should be able to use this library to train a model using your data. However, I will point out that there is no support for sparse matrices in this code at the moment, so you'll need to encode your data using dense vectors, even if only a few of the elements of each vector are nonzero.

Typically when I have a training problem that uses sparse data, I encode each minibatch on-the-fly using the function-passing interface provided by the Dataset class. For example:

import numpy as np
import theano
import theanets

# assume x is a vector with one entry per training data point.
# each element of x gives the integer index of the single "on"
# bit for that data item, so x represents a one-hot code of our
# dataset, where there are "dim" possible bits per item. 
x, dim = load_sparse_data()
e = theanets.Experiment(theanets.Classifier)

def batch():
    bs = e.args.batch_size
    mini = np.zeros((bs, dim), theano.config.floatX)
    # choose a random minibatch of indices from x
    idx = np.arange(len(x))
    np.random.shuffle(idx)
    idx = idx[:bs]
    mini[np.arange(bs), x[idx]] = 1.
    return mini

e.run(batch, batch)

This is just a sketch, but I hope that helps!

byzhang commented 10 years ago

Thanks Leif! It looks pretty good for me to try.