inspirehep / magpie

Deep neural network framework for multi-label text classification
MIT License
684 stars 192 forks source link

some question #99

Open comma2017 opened 7 years ago

comma2017 commented 7 years ago

1、 support Chinese? 2、How many categories tag support? my categories tag have 1000 3、the training sample support short text? emample: haier (The word level)

NowayIndustries commented 7 years ago

Hi, as a fellow magpie user I'll try to provide some insight & perhaps point you in the right direction for more research.

  1. I'm not sure, but there are a few things you can check
    • Does python support Chinese (I know Python3 supports uni-code, does that contain the characters you require?)
    • Does Keras (one of the frameworks magpie uses) support uni-code/Chinese characters
    • Does tensorflow/theano (You pick one of these for keras to use) support uni-code/Chinese characters

As far as the actual training & predicting goes, if the character sets are supported it should work. Magpie uses word2vec to learn the language of the training texts before starting training and is thus language independent (assuming the character set is supported).

Although some of the intricacies of the language might hinder the prediction performance, I am not a linguist and thus really have no idea.

The best solution to this would be to just try it, either magpie starts throwing exceptions or works just fine. Either outcome should give you insight into the usability of magpie for your use case.

  1. I assume you mean the amount of labels you can train a corpus for? I have trained a corpus that contains 4200 labels and that seems to work just fine. There is probably a theoretical limit somewhere in the software, probably based on the amount of memory available, but I haven't run into that problem yet.

  2. I'm not sure what you mean by short text, but if you look in the data/hep-categories folder in the repo you can find example files of the texts and their associated labels as magpie expects them.

Once you have the corpus ready you should be able to just follow the README instructions and try magpie.

jstypka commented 7 years ago

@NowayIndustries got it right 👍 In summary: 1) Yes 2) Yes, providing you have a good training set 3) Yes. I'm not sure what you mean exactly, but the input can be arbitrary small, even one word, providing it has enough predictive power. The software was designed though for texts up to 200 words, not sure about the performance for 1 word texts.

On another note @NowayIndustries, did you manage to successfully train Magpie on a 4.2k label corpus? Is this the dataset that you described in #86 ?

comma2017 commented 7 years ago

This load model, Each test results???Load model is wrong?

-- coding: UTF-8 --

import theano import keras from magpie import MagpieModel magpie = MagpieModel( keras_model='/workspace/magpie/model/model.h5', word2vec_model='/workspace/magpie/embeddings/word2vec', scaler='/workspace/magpie/scaler/scaler', labels=['cat21445577','cat15787568','cat21455995','cat21455994','cat21455993'] ) magpie.load_word2vec_model('/workspace/magpie/embeddings/word2vec') magpie.load_scaler('/workspace/magpie/scaler/scaler') magpie.load_model('/workspace/magpie/model/model.h5') print(magpie.predict_from_text('xiaomi'))

NowayIndustries commented 7 years ago

@comma2017 Can you show the code you use to train and then save the model? And also the pprint'd output from the prediction? You shouldn't have to call the load functions as you do now, since the MagpieModel constructor will check to see if your parameters are strings and if they are load those files for you.

Also remember that Magpie can only be as good as it's inputs, if you have a problem with the output look towards your inputs first (the text you are predicting for and the training & validation sets).

Just for good measure, here is the code I use in my MagpieHandler, still in the old style (labels is an array containing labels, given as parameter to the load function):

from magpie import MagpieModel
from magpie import utils
from keras.models import load_model

w2v = utils.load_from_disk(w2vpath)
scaler = utils.load_from_disk(scalerpath)
keras_model = load_model(keraspath)
self.instance = MagpieModel(keras_model=keras_model, word2vec_model=w2v, scaler=scaler, labels=labels)

=====================================

@jstypka Yes I have trained Magpie on a 4.2k label corpus, it is basicly the same corpus as in #86 but with more recent news articles also added.

I actually have several copies of this corpus which I train with a different level on the minimum label count to see if that actually has any impact (spoiler: not too much, as far as I can tell for now).

As you may or may not remember my project allows me to set a lower limit for the amount of times a label must occur in my raw-data pile of texts before I will train the model on it. With the limit set to 1 (so every label must occur at least once; thus every label) the model will be trained on 4224 labels. With the limit set to 10 (so every label that will be trained on must occur 10 times) the amount of labels is reduced to 459. With the limit set to 60 it is further reduced to 158 labels.

This limit also lowers the amounts of texts the corpus is then trained on (since only texts with the labels that are over the limit are copied into the training and validation sets)

From my very limited testing (running the same text through the different corpusses) the only real difference seems to be that the corpus with 4200 labels seems to know more labels as it returns 4 labels against the other two that only return 2 labels (above an 10% chance threshold).

There is some variation the the percentages but I will put that down to the way my training procedure works (it reshuffles/divides the raw data every time). And even then the differences are in the single % range.

It is also the training set I describe in #98, somehow speeding it up by 30x just by switching to tensorflow.

comma2017 commented 7 years ago

A total of more than 40000 samples, a sample about 10k.,i have 128G memory.but Out of memory using why???

jstypka commented 7 years ago

@comma2017 for this usecase there is a batch_train method that doesn't load everything into memory at once. It has the same functionality and API and you should not get any memory problems with it.

sayednafiz commented 7 years ago

Hi,

I have an error (UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1113: character maps to ) while I was trying to run the following lines. I looked into document.py and thought of including encoding type as utf-8 at line 28. However, your comment is much appreciated to solve this. Thanks. ###################Lines that I executed######### from magpie import MagpieModel magpie = MagpieModel() magpie.init_word_vectors("C:/P/Project/NN_classify_text/magpie_master/data/hep-categories", vec_dim=100) #################### The traceback is:

UnicodeDecodeError Traceback (most recent call last)

in () 2 3 magpie = MagpieModel() ----> 4 magpie.init_word_vectors("C:/P/Project/NN_classify_text/magpie_master/data/hep-categories", vec_dim=100) ~\Anaconda3\envs\snakes\lib\site-packages\magpie\main.py in init_word_vectors(self, train_dir, vec_dim) 212 :return: None 213 """ --> 214 self.train_word2vec(train_dir, vec_dim=vec_dim) 215 self.fit_scaler(train_dir) 216 ~\Anaconda3\envs\snakes\lib\site-packages\magpie\main.py in train_word2vec(self, train_dir, vec_dim) 226 print('WARNING! Overwriting already trained word2vec model.') 227 --> 228 self.word2vec_model = train_word2vec(train_dir, vec_dim=vec_dim) 229 230 return self.word2vec_model ~\Anaconda3\envs\snakes\lib\site-packages\magpie\base\word2vec.py in train_word2vec(doc_directory, vec_dim) 128 size=num_features, 129 min_count=min_word_count, --> 130 window=context, 131 ) 132 ~\Anaconda3\envs\snakes\lib\site-packages\gensim\models\word2vec.py in __init__(self, sentences, size, alpha, window, min_count, max_vocab_size, sample, seed, workers, min_alpha, sg, hs, negative, cbow_mean, hashfxn, iter, null_word, trim_rule, sorted_vocab, batch_words) 467 if isinstance(sentences, GeneratorType): 468 raise TypeError("You can't pass a generator as the sentences argument. Try an iterator.") --> 469 self.build_vocab(sentences, trim_rule=trim_rule) 470 self.train(sentences) 471 ~\Anaconda3\envs\snakes\lib\site-packages\gensim\models\word2vec.py in build_vocab(self, sentences, keep_raw_vocab, trim_rule, progress_per, update) 531 532 """ --> 533 self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey 534 self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling 535 self.finalize_vocab(update=update) # build tables & arrays ~\Anaconda3\envs\snakes\lib\site-packages\gensim\models\word2vec.py in scan_vocab(self, sentences, progress_per, trim_rule) 543 vocab = defaultdict(int) 544 checked_string_types = 0 --> 545 for sentence_no, sentence in enumerate(sentences): 546 if not checked_string_types: 547 if isinstance(sentence, string_types): ~\Anaconda3\envs\snakes\lib\site-packages\magpie\base\word2vec.py in __iter__(self) 112 files = {filename[:-4] for filename in os.listdir(self.dirname)} 113 for doc_id, fname in enumerate(files): --> 114 d = Document(doc_id, os.path.join(self.dirname, fname + '.txt')) 115 for sentence in d.read_sentences(): 116 yield sentence ~\Anaconda3\envs\snakes\lib\site-packages\magpie\base\document.py in __init__(self, doc_id, filepath, text) 27 28 with io.open(filepath, 'r') as f: ---> 29 self.text = f.read() 30 31 self.wordset = self.compute_wordset() ~\Anaconda3\envs\snakes\lib\encodings\cp1252.py in decode(self, input, final) 21 class IncrementalDecoder(codecs.IncrementalDecoder): 22 def decode(self, input, final=False): ---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0] 24 25 class StreamWriter(Codec,codecs.StreamWriter): UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1113: character maps to
jstypka commented 7 years ago

It's a Python encoding problem. To narrow it down, try to run:

import io
with io.open('C:/P/Project/NN_classify_text/magpie_master/data/hep-categories', 'r') as f:
    f.read()

This suggests that your files are not encoded in UTF8, but something else (probably some Windows encoding e.g. windows-1250).