piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.56k stars 4.37k forks source link

There is no load with file handle #1611

Open digitalex opened 6 years ago

digitalex commented 6 years ago

Through utils.SaveLoad, some classes (like Corups) offer saving with a filename or a file-like object (fname_or_handle). However, the corresponding load method only allows filenames, making it more cumbersome to work with in case one uses a non-standard file system.

If this feature (adding the ability to load from file-like object) is desirable, I'd be happy to send a PR.

menshikh-iv commented 6 years ago

Thanks for report @digitalex, can you show me an example where fname_or_handle will be useful for load method? The main problem is save can produce several files (not only one), for this reason, load accept the only filename.

piskvorky commented 6 years ago

Being able to load from a file handle would be awesome.

Like @menshikh-iv says, this will limit you to objects that save just one file. Note that you could achieve the same effect with plain cPickle.dump / cPickle.load. You won't be able to use memory-mapping (mmap), you will be limited by pickle's max size (3GB IIRC) etc. But it might still cover a lot of use cases.

One advantage I see in using gensim's load(handle) instead of pickle.load(handle) is that some classes use extra logic inside load, such as checking backward compatibility etc. You won't get this with pickle.

piskvorky commented 6 years ago

Another point: gensim uses smart_open internally, so that users can transparently save/load from S3, HDFS, compressed etc (not just the local filesystem).

So the handle-loading capability should really expect a minimum "file-like" object, and not rely on too many fancy local-filesystem-only operations (even seek is tricky). A simple .read() is always fine though.

digitalex commented 6 years ago

@menshikh-iv an example that I'm encountering right now is to save to and load from our proprietary HDFS-like distributed file system - since I want to build the corpus and later an LSI model with a MapReduce job, inputs and outputs have to go to our FS, not a local file on a worker machine.

I recognize the issue with several files, good point that I didn't realize earlier. It may be possible to overcome this problem by using pickle's persistent_id feature, but I can't say that sounds like a fantastic idea.

My backup strategy is to do what @piskvorky suggests, pickling/unpickling myself. I can probably use save() since I believe it would only save a single file (with pickle) if it gets a file handle (the object has a 'write' attr).

piskvorky commented 6 years ago

That's an interesting idea with the persistent_id, using custom picklers/unpicklers in place of custom save/load.

Either way, adding support for file handles would be awesome @digitalex. Can you add that in a PR?