Open comma2017 opened 7 years ago
Hi, as a fellow magpie user I'll try to provide some insight & perhaps point you in the right direction for more research.
As far as the actual training & predicting goes, if the character sets are supported it should work. Magpie uses word2vec to learn the language of the training texts before starting training and is thus language independent (assuming the character set is supported).
Although some of the intricacies of the language might hinder the prediction performance, I am not a linguist and thus really have no idea.
The best solution to this would be to just try it, either magpie starts throwing exceptions or works just fine. Either outcome should give you insight into the usability of magpie for your use case.
I assume you mean the amount of labels you can train a corpus for? I have trained a corpus that contains 4200 labels and that seems to work just fine. There is probably a theoretical limit somewhere in the software, probably based on the amount of memory available, but I haven't run into that problem yet.
I'm not sure what you mean by short text, but if you look in the data/hep-categories folder in the repo you can find example files of the texts and their associated labels as magpie expects them.
Once you have the corpus ready you should be able to just follow the README instructions and try magpie.
@NowayIndustries got it right 👍 In summary: 1) Yes 2) Yes, providing you have a good training set 3) Yes. I'm not sure what you mean exactly, but the input can be arbitrary small, even one word, providing it has enough predictive power. The software was designed though for texts up to 200 words, not sure about the performance for 1 word texts.
On another note @NowayIndustries, did you manage to successfully train Magpie on a 4.2k label corpus? Is this the dataset that you described in #86 ?
This load model, Each test results???Load model is wrong?
import theano import keras from magpie import MagpieModel magpie = MagpieModel( keras_model='/workspace/magpie/model/model.h5', word2vec_model='/workspace/magpie/embeddings/word2vec', scaler='/workspace/magpie/scaler/scaler', labels=['cat21445577','cat15787568','cat21455995','cat21455994','cat21455993'] ) magpie.load_word2vec_model('/workspace/magpie/embeddings/word2vec') magpie.load_scaler('/workspace/magpie/scaler/scaler') magpie.load_model('/workspace/magpie/model/model.h5') print(magpie.predict_from_text('xiaomi'))
@comma2017 Can you show the code you use to train and then save the model? And also the pprint'd output from the prediction? You shouldn't have to call the load functions as you do now, since the MagpieModel constructor will check to see if your parameters are strings and if they are load those files for you.
Also remember that Magpie can only be as good as it's inputs, if you have a problem with the output look towards your inputs first (the text you are predicting for and the training & validation sets).
Just for good measure, here is the code I use in my MagpieHandler, still in the old style (labels is an array containing labels, given as parameter to the load function):
from magpie import MagpieModel
from magpie import utils
from keras.models import load_model
w2v = utils.load_from_disk(w2vpath)
scaler = utils.load_from_disk(scalerpath)
keras_model = load_model(keraspath)
self.instance = MagpieModel(keras_model=keras_model, word2vec_model=w2v, scaler=scaler, labels=labels)
@jstypka Yes I have trained Magpie on a 4.2k label corpus, it is basicly the same corpus as in #86 but with more recent news articles also added.
I actually have several copies of this corpus which I train with a different level on the minimum label count to see if that actually has any impact (spoiler: not too much, as far as I can tell for now).
As you may or may not remember my project allows me to set a lower limit for the amount of times a label must occur in my raw-data pile of texts before I will train the model on it. With the limit set to 1 (so every label must occur at least once; thus every label) the model will be trained on 4224 labels. With the limit set to 10 (so every label that will be trained on must occur 10 times) the amount of labels is reduced to 459. With the limit set to 60 it is further reduced to 158 labels.
This limit also lowers the amounts of texts the corpus is then trained on (since only texts with the labels that are over the limit are copied into the training and validation sets)
From my very limited testing (running the same text through the different corpusses) the only real difference seems to be that the corpus with 4200 labels seems to know more labels as it returns 4 labels against the other two that only return 2 labels (above an 10% chance threshold).
There is some variation the the percentages but I will put that down to the way my training procedure works (it reshuffles/divides the raw data every time). And even then the differences are in the single % range.
It is also the training set I describe in #98, somehow speeding it up by 30x just by switching to tensorflow.
A total of more than 40000 samples, a sample about 10k.,i have 128G memory.but Out of memory using why???
@comma2017 for this usecase there is a batch_train
method that doesn't load everything into memory at once. It has the same functionality and API and you should not get any memory problems with it.
Hi,
I have an error (UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1113: character maps to
UnicodeDecodeError Traceback (most recent call last)
It's a Python encoding problem. To narrow it down, try to run:
import io
with io.open('C:/P/Project/NN_classify_text/magpie_master/data/hep-categories', 'r') as f:
f.read()
This suggests that your files are not encoded in UTF8, but something else (probably some Windows encoding e.g. windows-1250
).
1、 support Chinese? 2、How many categories tag support? my categories tag have 1000 3、the training sample support short text? emample: haier (The word level)