piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.62k stars 4.38k forks source link

BleiCorpus after serialize cannot be loaded #1171

Open vincentmajor opened 7 years ago

vincentmajor commented 7 years ago

Python version 2.7.12 gensim version 0.13.2

I'm serializing my corpus into Blei LDA-C format using corpora.BleiCorpus.serialize(filename, corpus) which is then later used in a dynamic topic model, and not in python. (I know I can use the DTMModel wrapper, unrelated.)

If I need to come back and load the corpus back into Python I tried corpora.BleiCorpus.load(filename), I get an unpickling error:

 ---------------------------------------------------------------------------
 UnpicklingError                           Traceback (most recent call last)
<ipython-input-28-0dd15393a941> in <module>()
----> 1 test = corpora.BleiCorpus.load('corpus-mult.dat')

/Applications/anaconda/envs/py27/lib/python2.7/site-packages/gensim/utils.pyc in load(cls, fname, mmap)
    246         compress, subname = SaveLoad._adapt_by_suffix(fname)
    247 
--> 248         obj = unpickle(fname)
    249         obj._load_specials(fname, mmap, compress, subname)
    250         return obj

/Applications/anaconda/envs/py27/lib/python2.7/site-packages/gensim/utils.pyc in unpickle(fname)
    909     with smart_open(fname) as f:
    910         # Because of loading from S3 load can't be used (missing readline in smart_open)
--> 911         return _pickle.loads(f.read())
    912 
    913 

UnpicklingError: invalid load key, '7'.

The only other argument to load() is mmap but I don't believe the arrays were stored separately and using mmap='r' doesn't help anyway.

tmylk commented 7 years ago

@vincentmajor Thanks for reporting it. Are you trying to load the model with 0.13.3 or older version? There is a know issue that is currently being worked on https://github.com/RaRe-Technologies/gensim/issues/1082

vincentmajor commented 7 years ago

@tmylk Yes, gensim version 0.13.2 on Python 2.7.12. But, that issue referenced LDA models whereas I am serializing and loading a corpus without a model. I don't know if the same absence of a random_state variable influences loading of a corpus which should be random state invariant.

tmylk commented 7 years ago

Agree, LdaModel is not related to BleiCorpus. There haven't been any changes. Are you serializing and loading on the same platform, say Linux?

vincentmajor commented 7 years ago

Yep, OS X. It doesn't even work loading straight after serializing.

tmylk commented 7 years ago

do you have a small text and code example to reproduce?

vincentmajor commented 7 years ago

Not really, I'll generate one now.

vincentmajor commented 7 years ago

Okay here is what I am doing. Apologies about the nltk import.

import numpy as np
import pandas as pd

df = pd.DataFrame(["The quick brown fox jumps over the lazy dog.", "This is a test."], columns=["text"])
#df.head()

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+') 

token_list = [tokenizer.tokenize(document) for document in df["text"]]
#print token_list

from gensim import corpora
dictionary = corpora.Dictionary(token_list)

## using the dictionary to convert from document to bag of words (doc2bow) encoding
corpus = [dictionary.doc2bow(document) for document in token_list]
## store the corpus into lda-c format, same as used by DTM,
corpora.BleiCorpus.serialize('~/Desktop/test_corpus-mult.dat', corpus)

## fails here
test_corpus = corpora.BleiCorpus.load('~/Desktop/test_corpus-mult.dat')
pranaydeeps commented 7 years ago

Hi, I am looking to fix this issue. Could I get some directions as to where these particular functions are implemented in the code base? Thank you.

pranaydeeps commented 7 years ago

So I looked around and it seems that the load() function loads a pickle whereas the serialize() function calls save_corpus() which does not store a pickle at all but stores in plain text with a .vocab and a .index file if I am not wrong. The fix would be to either change the load() function to read the plain text format or the serialize function calls save() instead of save_corpus() as save() stores in pickle format. Tell me if I am going wrong anywhere @tmylk

pranaydeeps commented 7 years ago

Everything works fine if instead of using the load() function you use the init method from BleiCorpus ie. a = corpora.BleiCorpus('/home/artemis/Desktop/test_corpus-mult.dat') This uses the same format to load as the save_corpus() function. So to sum it up, the save_corpus() is analogous to the default constructor of the class and load() function stores in pickle and is analogous to the save() method.

piskvorky commented 7 years ago

Does this mean the issue is resolved? Can we close?

@pranaydeep-af what would have helped you avoid this mistake? I mean, how would you change the documentation or API so that the behaviour is clear, avoiding the issue? Or at least finding the solution more quickly.

pranaydeeps commented 7 years ago

@piskvorky Maybe we can change the save() and load() functions to save_as_pickle() and load_pickle() so the behaviour is clear. Also, if both ways to store are not needed. Maybe we can implement one common way to save and load? Awaiting confirmation of my findings from @tmylk

piskvorky commented 7 years ago

-1 on save_pickle and load_pickle. The pickle format is only incidental. And it's not even true it's always pickle -- we already store (sub)attributes in various ways, such as direct .npy files to allow memory-mapping.

There is only one way to load/save objects in gensim (called .load() and .save()). Serializing corpora (lossy text formats) is not the same as serializing objects (binary lossless formats).

pranaydeeps commented 7 years ago

Ah, So how about if we catch the unpickling error in the load() function and write an error message which suggests the user to try the default constructor for loading instead? @piskvorky

tmylk commented 7 years ago

A warning on the load, the same way as on the save would be good. @piskvorky What is the use case of the save/load of an iterator for a corpus? can it be better isolated?

pranaydeeps commented 7 years ago

Should I add a warning and send a PR or are we looking for a better solution?

piskvorky commented 7 years ago

@tmylk I don't think there's any use case for pickling iterators... in fact, I don't think it can even be done in Python.

Pickling a "streamed object" itself is pretty much always an error, I'd say. That's not what a user wants (they want to serialize the content inside, not the streaming object itself).

tmylk commented 7 years ago

@pranaydeep-af Can you experiment and see what is actually saved in the save() method? To confirm that it is not a useful method. If confirmed then please add an exception to load and save methods.

pranaydeeps commented 7 years ago

Sure thing, @tmylk

pranaydeeps commented 7 years ago

The exception in load() would be for an Unpickling Error. What sort of exception is desired in the save() method since it does not generate an error? @tmylk

pranaydeeps commented 7 years ago

@tmylk Basic PR with try-catch block for load() function has been submitted.

tmylk commented 7 years ago

@pranaydeep-af What is actualy saved and loaded with the save/load methods in the corpus? Is it useful?

If they are not useful, then my request above was to throw(not catch) an exception in the corpus load and save methods to alert the user that they are not useful methods.

pranaydeeps commented 7 years ago

@tmylk Apologies for the delay. Had to deal with some college course work. I went through the save() and load() methods and did a few experiments. The save() method stores a pickle dump of the particular object. So in the case of a corpora.dictionary object that is rendered useless due to the presence of a superior method ie. save_corpus(), however the function may be handy when used for saving pickles of some other objects. So I have added a check to see if the function was called from a corpora object and if it was, I have raised an exception. Otherwise the function works as usual. For the load function I have added a warning to not use it for loading corpora objects. Do tell me if some naive mistake has been made.

tmylk commented 7 years ago

@pranaydeep-af Could you please be more explicit and provide code samples when save/load are useful and when they are not? Having difficulty understanding you

pranaydeeps commented 7 years ago

@tmylk What I meant was, save() load() methods are not useful for corpus maybe, but they are part of the larger SaveLoad class which many derived classes use. As an example, the DocVecsArray() class which saves Doc2Vec arrays after the model is generated also is a base class of utils.SaveLoad.

model = Doc2Vec(sentences) model.save('/tmp/my_model.doc2vec') model_loaded = Doc2Vec.load('/tmp/my_model.doc2vec')

Thus, if save/load are deprecated in general, corpus might work fine with save_corpus() but other things might have trouble.

tmylk commented 7 years ago

Are you saying that corpus should not inherit from SaveLoad?

pranaydeeps commented 7 years ago

@tmylk Yes, that would work because corpus already have independent functions for saving loading. Should I go ahead with that then?

pranaydeeps commented 7 years ago

@tmylk How do I proceed with this?

tmylk commented 7 years ago

@pranaydeep-af Could you please investigate what is actualy saved and loaded with the save/load methods in the corpus?

pranaydeeps commented 7 years ago

@tmylk I did that in a comment 9 days ago. Here's the comment again:

@tmylk Apologies for the delay. Had to deal with some college course work. I went through the save() and load() methods and did a few experiments. The save() method stores a pickle dump of the particular object. So in the case of a corpora.dictionary object that is rendered useless due to the presence of a superior method ie. save_corpus(), however the function may be handy when used for saving pickles of some other objects.