Open vincentmajor opened 7 years ago
@vincentmajor Thanks for reporting it. Are you trying to load the model with 0.13.3 or older version? There is a know issue that is currently being worked on https://github.com/RaRe-Technologies/gensim/issues/1082
@tmylk Yes, gensim version 0.13.2 on Python 2.7.12.
But, that issue referenced LDA models whereas I am serializing and loading a corpus without a model. I don't know if the same absence of a random_state
variable influences loading of a corpus which should be random state invariant.
Agree, LdaModel is not related to BleiCorpus. There haven't been any changes. Are you serializing and loading on the same platform, say Linux?
Yep, OS X. It doesn't even work loading straight after serializing.
do you have a small text and code example to reproduce?
Not really, I'll generate one now.
Okay here is what I am doing. Apologies about the nltk import.
import numpy as np
import pandas as pd
df = pd.DataFrame(["The quick brown fox jumps over the lazy dog.", "This is a test."], columns=["text"])
#df.head()
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
token_list = [tokenizer.tokenize(document) for document in df["text"]]
#print token_list
from gensim import corpora
dictionary = corpora.Dictionary(token_list)
## using the dictionary to convert from document to bag of words (doc2bow) encoding
corpus = [dictionary.doc2bow(document) for document in token_list]
## store the corpus into lda-c format, same as used by DTM,
corpora.BleiCorpus.serialize('~/Desktop/test_corpus-mult.dat', corpus)
## fails here
test_corpus = corpora.BleiCorpus.load('~/Desktop/test_corpus-mult.dat')
Hi, I am looking to fix this issue. Could I get some directions as to where these particular functions are implemented in the code base? Thank you.
So I looked around and it seems that the load() function loads a pickle whereas the serialize() function calls save_corpus() which does not store a pickle at all but stores in plain text with a .vocab and a .index file if I am not wrong. The fix would be to either change the load() function to read the plain text format or the serialize function calls save() instead of save_corpus() as save() stores in pickle format. Tell me if I am going wrong anywhere @tmylk
Everything works fine if instead of using the load() function you use the init method from BleiCorpus ie.
a = corpora.BleiCorpus('/home/artemis/Desktop/test_corpus-mult.dat')
This uses the same format to load as the save_corpus() function. So to sum it up, the save_corpus() is analogous to the default constructor of the class and load() function stores in pickle and is analogous to the save() method.
Does this mean the issue is resolved? Can we close?
@pranaydeep-af what would have helped you avoid this mistake? I mean, how would you change the documentation or API so that the behaviour is clear, avoiding the issue? Or at least finding the solution more quickly.
@piskvorky Maybe we can change the save() and load() functions to save_as_pickle() and load_pickle() so the behaviour is clear. Also, if both ways to store are not needed. Maybe we can implement one common way to save and load? Awaiting confirmation of my findings from @tmylk
-1 on save_pickle
and load_pickle
. The pickle format is only incidental. And it's not even true it's always pickle -- we already store (sub)attributes in various ways, such as direct .npy files to allow memory-mapping.
There is only one way to load/save objects in gensim (called .load()
and .save()
). Serializing corpora (lossy text formats) is not the same as serializing objects (binary lossless formats).
Ah, So how about if we catch the unpickling error in the load() function and write an error message which suggests the user to try the default constructor for loading instead? @piskvorky
A warning on the load, the same way as on the save would be good. @piskvorky What is the use case of the save/load of an iterator for a corpus? can it be better isolated?
Should I add a warning and send a PR or are we looking for a better solution?
@tmylk I don't think there's any use case for pickling iterators... in fact, I don't think it can even be done in Python.
Pickling a "streamed object" itself is pretty much always an error, I'd say. That's not what a user wants (they want to serialize the content inside, not the streaming object itself).
@pranaydeep-af Can you experiment and see what is actually saved in the save() method? To confirm that it is not a useful method. If confirmed then please add an exception to load and save methods.
Sure thing, @tmylk
The exception in load() would be for an Unpickling Error. What sort of exception is desired in the save() method since it does not generate an error? @tmylk
@tmylk Basic PR with try-catch block for load() function has been submitted.
@pranaydeep-af What is actualy saved and loaded with the save/load methods in the corpus? Is it useful?
If they are not useful, then my request above was to throw(not catch) an exception in the corpus load and save methods to alert the user that they are not useful methods.
@tmylk Apologies for the delay. Had to deal with some college course work. I went through the save() and load() methods and did a few experiments. The save() method stores a pickle dump of the particular object. So in the case of a corpora.dictionary object that is rendered useless due to the presence of a superior method ie. save_corpus(), however the function may be handy when used for saving pickles of some other objects. So I have added a check to see if the function was called from a corpora object and if it was, I have raised an exception. Otherwise the function works as usual. For the load function I have added a warning to not use it for loading corpora objects. Do tell me if some naive mistake has been made.
@pranaydeep-af Could you please be more explicit and provide code samples when save/load are useful and when they are not? Having difficulty understanding you
@tmylk What I meant was, save() load() methods are not useful for corpus maybe, but they are part of the larger SaveLoad class which many derived classes use. As an example, the DocVecsArray() class which saves Doc2Vec arrays after the model is generated also is a base class of utils.SaveLoad.
model = Doc2Vec(sentences)
model.save('/tmp/my_model.doc2vec')
model_loaded = Doc2Vec.load('/tmp/my_model.doc2vec')
Thus, if save/load are deprecated in general, corpus might work fine with save_corpus() but other things might have trouble.
Are you saying that corpus should not inherit from SaveLoad?
@tmylk Yes, that would work because corpus already have independent functions for saving loading. Should I go ahead with that then?
@tmylk How do I proceed with this?
@pranaydeep-af Could you please investigate what is actualy saved and loaded with the save/load methods in the corpus?
@tmylk I did that in a comment 9 days ago. Here's the comment again:
@tmylk Apologies for the delay. Had to deal with some college course work. I went through the save() and load() methods and did a few experiments. The save() method stores a pickle dump of the particular object. So in the case of a corpora.dictionary object that is rendered useless due to the presence of a superior method ie. save_corpus(), however the function may be handy when used for saving pickles of some other objects.
Python version 2.7.12 gensim version 0.13.2
I'm serializing my corpus into Blei LDA-C format using
corpora.BleiCorpus.serialize(filename, corpus)
which is then later used in a dynamic topic model, and not in python. (I know I can use the DTMModel wrapper, unrelated.)If I need to come back and load the corpus back into Python I tried
corpora.BleiCorpus.load(filename)
, I get an unpickling error:The only other argument to
load()
ismmap
but I don't believe the arrays were stored separately and usingmmap='r'
doesn't help anyway.