giuseppebonaccorso / Reuters-21578-Classification

Text classification with Reuters-21578 datasets using Gensim Word2Vec and Keras LSTM
http://www.bonaccorso.eu
MIT License
43 stars 41 forks source link

MemoryError #1

Closed jingweimo closed 7 years ago

jingweimo commented 7 years ago

I am trying to run you codes, but got a memory error in Word2Vec conversion:

X = np.zeros(shape=(number_of_documents, document_max_num_words, num_features)).astype('float32')

MemoryError

I also note there are probably some typos, for example: vector = zeros(len(target_categories)).astype(float32) I change it into: vector = np.zeros(len(target_categories)).astype('float32')

Can you show your implementation environment, such as keras version, operation system and RAM?

giuseppebonaccorso commented 7 years ago

It seems that your problem has nothing to do with word2vec (which is created by gensim), but with the allocation of an array by Numpy. Can you post the complete stacktrace?

The dtype can be specified using a string or through the variable np.float32. The namespace "np" is automatically injected by "%pylab inline". Therefore you can leave everything like it is or change it.

My env has 32 GB RAM, but many people worked it out with also with 8 GB.

jingweimo commented 7 years ago

@giuseppebonaccorso: You are right. My pc only has 4G RAM. No such error happens to a 8G RAM computer.

I note the modeling part only deals with the selected categories, which is 'pl-usa', only one category. Have you tested on more categories? And your accuracy? Keras provides an example on using MLP for 46-class Reuters Topics (https://github.com/fchollet/keras/blob/master/examples/reuters_mlp.py), which seems slightly different from Reuters-21578, with test accuracies of about 80%. But I use LSTM to classify the Reuters (https://github.com/fchollet/keras/blob/master/keras/datasets/reuters.py) only to get about 40% test accuracy. I have tried different setup of hidden units and optimizers but cannot achieve the MLP comparable performance. I will see how LSTM works for reduced classes? Have you tried to address the 46-class task?

giuseppebonaccorso commented 7 years ago

Yes, 'pl-usa' is a category assigned to 12542 newslines; in other words, you can perform a binary classification to select if a post belongs to this category and then classify the others. (If you check here you can find the distribution of top categories).

You can try with a Multinomial Naive Bayes, to check for the probabilities underlying the category 'pl-usa', but I'am afraid it'll be always a problem if you try to classify with such an unbalanced dataset. Another possibility is to reduce the number of samples with 'pl-usa' so to normalize the distribution of the training set. I've tried adding more categories (up to 100 out of 672, which is the total number in the original dataset) and the accuracy is always high (>80%) when the training set is correctly balanced.

I did't check the Keras dataset, but I think it's already normalized (only 46 categories). If you look at the original distribution, you can see that it's hard to work with the raw dataset without any preprocessing or limitation, however you have a good point to work on!