piskvorky / sim-shootout

Code for "Performance shootout between nearest-neighbour libraries": http://radimrehurek.com/2013/11/performance-shootout-of-nearest-neighbours-intro
MIT License
100 stars 28 forks source link

AttributeError: 'module' object has no attribute 'getstream' #1

Closed JDonner closed 9 years ago

JDonner commented 9 years ago

Radim, perhaps you refactored gensim out from under your shootout code. I get the following when I try to run part of it:

(gensim-play).../gensim-play/shootout> ./run-prepare.sh data 2015-04-02 01:10:17,007 : INFO : running ./prepare_shootout.py /mnt/raid/torrents/enwiki-20140707-pages-articles.xml.bz2 data Traceback (most recent call last): File "./prepare_shootout.py", line 132, in corpus = ShootoutCorpus(gensim.utils.smart_open(preprocessed_file)) File "/media/jd/7adf8f2f-25e0-4c88-8f0d-7d1a91ba07ec/home/jd/raid/test/projects/python-envs/gensim-play/local/lib/python2.7/site-packages/gensim-0.10.3-py2.7-linux-x86_64.egg/gensim/corpora/textcorpus.py", line 61, in init self.dictionary.add_documents(self.get_texts()) File "/media/jd/7adf8f2f-25e0-4c88-8f0d-7d1a91ba07ec/home/jd/raid/test/projects/python-envs/gensim-play/local/lib/python2.7/site-packages/gensim-0.10.3-py2.7-linux-x86_64.egg/gensim/corpora/dictionary.py", line 119, in add_documents for docno, document in enumerate(documents): File "./prepare_shootout.py", line 92, in get_texts lines = gensim.corpora.textcorpus.getstream(self.input) # open file/reset stream to its start AttributeError: 'module' object has no attribute 'getstream'

For full context, run_prepare.sh is just your run_all.sh with a hardwired enwiki...bz2 file and the actual shootout removed, it's just preparation:

#### run_prepare.sh ####

!/bin/bash

EXPECTED_ARGS=1 E_BADARGS=65

if [ $# -ne $EXPECTED_ARGS ] then

when run without params, print help and exit

echo "first argument must be a directory where data & indexes will be stored; make sure you have at least 100gb free space there" exit $E_BADARGS fi

datadir=$1 shift 1

first, download the raw wiki dump and convert it to LSI vectors, if not already present

articles=$datadir/lsi_vectors.mm.gz if [ ! -e $articles ]; then

wiki_file=$datadir/enwiki-latest-pages-articles.xml.bz2

wiki_file=/mnt/raid/torrents/enwiki-20140707-pages-articles.xml.bz2
if [ ! -e $wiki_file ]; then
    echo "downloading wiki dump to $wiki_file"
    wget -O $wiki_file 'http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2'
fi
./prepare_shootout.py $wiki_file $datadir 2>&1 | tee ./log_prepare.txt
gzip -v $datadir/lsi_vectors.mm

fi

Thanks.

piskvorky commented 9 years ago

Yes, it's been refactorered since.

Can you try replacing the gensim.corpora.textcorpus.getstream(self.input) with gensim.utils.file_or_filename(self.input)?

And if that works, can you send a pull request against the shootout repo with this fix? It could help other people who run into the same issue in the future. Cheers.

JDonner commented 9 years ago

Actually both that and what I tried, self.getstream(), have it that 'lines' is a GeneratorContextManager, not an iterable:

(gensim-play).../gensim-play/shootout> ./run-prepare.sh data 2015-04-02 07:43:21,593 : INFO : running ./prepare_shootout.py /mnt/raid/torrents/enwiki-20140707-pages-articles.xml.bz2 data Traceback (most recent call last): File "./prepare_shootout.py", line 134, in corpus = ShootoutCorpus(gensim.utils.smart_open(preprocessed_file)) File "/media/jd/7adf8f2f-25e0-4c88-8f0d-7d1a91ba07ec/home/jd/raid/test/projects/python-envs/gensim-play/local/lib/python2.7/site-packages/gensim-0.10.3-py2.7-linux-x86_64.egg/gensim/corpora/textcorpus.py", line 61, in init self.dictionary.add_documents(self.get_texts()) File "/media/jd/7adf8f2f-25e0-4c88-8f0d-7d1a91ba07ec/home/jd/raid/test/projects/python-envs/gensim-play/local/lib/python2.7/site-packages/gensim-0.10.3-py2.7-linux-x86_64.egg/gensim/corpora/dictionary.py", line 119, in add_documents for docno, document in enumerate(documents): File "./prepare_shootout.py", line 95, in get_texts for lineno, line in enumerate(lines): TypeError: 'GeneratorContextManager' object is not iterable

class ShootoutCorpus(gensim.corpora.TextCorpus): def get_texts(self):

lines = gensim.corpora.textcorpus.getstream(self.input) # open file/reset stream to its start

#        lines = gensim.utils.file_or_filename(self.input) 
    lines = self.getstream()
    for lineno, line in enumerate(lines):           #### line 95
        yield line.split('\t')[1].split()  # return tokens (ignore the title before the tab)
    self.length = lineno + 1
JDonner commented 9 years ago

I get further with this:

class ShootoutCorpus(gensim.corpora.TextCorpus):
    def get_texts(self):
        lineno = 0
        with self.getstream() as lines:
            for lineno, line in enumerate(lines):
                yield line.split('\t')[1].split()  # return tokens (ignore the title before the tab)
            self.length = lineno + 1

(gensim-play).../gensim-play/shootout> ./run-prepare.sh data 2015-04-02 08:23:16,556 : INFO : running ./prepare_shootout.py /mnt/raid/torrents/enwiki-20140707-pages-articles.xml.bz2 data 2015-04-02 08:23:16,557 : INFO : built Dictionary(0 unique tokens: []) from 0 documents (total 0 corpus positions) 2015-04-02 08:23:16,558 : INFO : discarding 0 tokens: []... 2015-04-02 08:23:16,558 : INFO : keeping 0 tokens which were in no less than 20 and no more than 0 (=10.0%) documents 2015-04-02 08:23:16,558 : INFO : resulting dictionary: Dictionary(0 unique tokens: []) 2015-04-02 08:23:16,558 : INFO : saving Dictionary object under data/dictionary, separately None 2015-04-02 08:23:16,558 : INFO : saving dictionary mapping to data/dictionary.txt 2015-04-02 08:23:16,558 : INFO : collecting document frequencies 2015-04-02 08:23:16,559 : INFO : calculating IDF weights for 0 documents and 0 features (0 matrix non-zeros) 2015-04-02 08:23:16,559 : INFO : saving TfidfModel object under data/tfidf.model, separately None 2015-04-02 08:23:16,559 : INFO : using serial LSI version on this node 2015-04-02 08:23:16,559 : INFO : updating model with new documents 2015-04-02 08:23:16,559 : INFO : saving Projection object under data/lsi.model.projection, separately None 2015-04-02 08:23:16,560 : INFO : saving LsiModel object under data/lsi.model, separately None 2015-04-02 08:23:16,560 : INFO : not storing attribute projection 2015-04-02 08:23:16,560 : INFO : not storing attribute dispatcher Traceback (most recent call last): File "./prepare_shootout.py", line 159, in gensim.corpora.MmCorpus.serialize(vectors_file, (gensim.matutils.unitvec(vec) for vec in lsi[tfidf[corpus]])) File "/media/jd/7adf8f2f-25e0-4c88-8f0d-7d1a91ba07ec/home/jd/raid/test/projects/python-envs/gensim-play/local/lib/python2.7/site-packages/gensim-0.10.3-py2.7-linux-x86_64.egg/gensim/models/lsimodel.py", line 423, in __getitem__ assert self.projection.u is not None, "decomposition not initialized yet" AssertionError: decomposition not initialized yet

.. but then get the above. I'll keep looking at it if you don't get to it.

JDonner commented 9 years ago

Closing this one, I'll open another if I don't make progress on this most recent one.