Open shashi-netra opened 6 years ago
@shashi-netra you're right, having an other option of loading files would be a reasonable feature. I think you're actually not the first who suggested that. It shouldn't be difficult to implement, but I can't promise I'll have time to do that in the near future. You're welcome to take a stab at it and open a PR!
@jstypka Can you please indicate what the input format looks like? Is it embedding arrays for the inputs and one hot arrays for label?
@dorg-ekrolewicz the output is one-hot arrays and the input is a 2D array - each row being a word represented as a word2vec vector. A batch of several document would make a 3D tensor. Does that help?
Are you using padding?
Ex for classifying cats and dogs: num_classes = 2 max_num_words = number of words in x = 10 (in this example)
Inputs: 1) x = "the dog is red" y = [0,1] where num_words = 4 2) x = "the cat and dog are blue" y = [1,1] where num_words = 6
Since we have m=2 examples, the input dimensions would be a (m, embedding_dim, max_num_words)?
@dorg-ekrolewicz yes, that looks correct. We pad with 0s until max_num_words
and throw a 0 vector if we don't have a representation for a word (unfamiliar vocabulary).
Pretty much all the code is in this function.
In the current version you have
.lab
and.txt
files - one each for a training row. Wouldn't it be easier to save these in a single file or a single one for labels and another for text files? Wouldn't this be more idiomatic (a la scikit-learn)Having several million
.lab
files and.txt
files is especially problematic when there are millions of files and the filesystem chokes up.