inspirehep / magpie

Deep neural network framework for multi-label text classification
MIT License
684 stars 192 forks source link

suggestion: Use a single file for labels and text #151

Open shashi-netra opened 6 years ago

shashi-netra commented 6 years ago

In the current version you have .lab and .txt files - one each for a training row. Wouldn't it be easier to save these in a single file or a single one for labels and another for text files? Wouldn't this be more idiomatic (a la scikit-learn)

Having several million .lab files and .txt files is especially problematic when there are millions of files and the filesystem chokes up.

jstypka commented 6 years ago

@shashi-netra you're right, having an other option of loading files would be a reasonable feature. I think you're actually not the first who suggested that. It shouldn't be difficult to implement, but I can't promise I'll have time to do that in the near future. You're welcome to take a stab at it and open a PR!

dorg-ekrolewicz commented 6 years ago

@jstypka Can you please indicate what the input format looks like? Is it embedding arrays for the inputs and one hot arrays for label?

jstypka commented 6 years ago

@dorg-ekrolewicz the output is one-hot arrays and the input is a 2D array - each row being a word represented as a word2vec vector. A batch of several document would make a 3D tensor. Does that help?

dorg-ekrolewicz commented 6 years ago

Are you using padding?

Ex for classifying cats and dogs: num_classes = 2 max_num_words = number of words in x = 10 (in this example)

Inputs: 1) x = "the dog is red" y = [0,1] where num_words = 4 2) x = "the cat and dog are blue" y = [1,1] where num_words = 6

Since we have m=2 examples, the input dimensions would be a (m, embedding_dim, max_num_words)?

jstypka commented 6 years ago

@dorg-ekrolewicz yes, that looks correct. We pad with 0s until max_num_words and throw a 0 vector if we don't have a representation for a word (unfamiliar vocabulary).

Pretty much all the code is in this function.