Closed JDonner closed 9 years ago
Yes, it's been refactorered since.
Can you try replacing the gensim.corpora.textcorpus.getstream(self.input)
with gensim.utils.file_or_filename(self.input)
?
And if that works, can you send a pull request against the shootout repo with this fix? It could help other people who run into the same issue in the future. Cheers.
Actually both that and what I tried, self.getstream(), have it that 'lines' is a GeneratorContextManager, not an iterable:
(gensim-play).../gensim-play/shootout> ./run-prepare.sh data
2015-04-02 07:43:21,593 : INFO : running ./prepare_shootout.py /mnt/raid/torrents/enwiki-20140707-pages-articles.xml.bz2 data
Traceback (most recent call last):
File "./prepare_shootout.py", line 134, in
class ShootoutCorpus(gensim.corpora.TextCorpus): def get_texts(self):
# lines = gensim.utils.file_or_filename(self.input)
lines = self.getstream()
for lineno, line in enumerate(lines): #### line 95
yield line.split('\t')[1].split() # return tokens (ignore the title before the tab)
self.length = lineno + 1
I get further with this:
class ShootoutCorpus(gensim.corpora.TextCorpus):
def get_texts(self):
lineno = 0
with self.getstream() as lines:
for lineno, line in enumerate(lines):
yield line.split('\t')[1].split() # return tokens (ignore the title before the tab)
self.length = lineno + 1
(gensim-play).../gensim-play/shootout> ./run-prepare.sh data
2015-04-02 08:23:16,556 : INFO : running ./prepare_shootout.py /mnt/raid/torrents/enwiki-20140707-pages-articles.xml.bz2 data
2015-04-02 08:23:16,557 : INFO : built Dictionary(0 unique tokens: []) from 0 documents (total 0 corpus positions)
2015-04-02 08:23:16,558 : INFO : discarding 0 tokens: []...
2015-04-02 08:23:16,558 : INFO : keeping 0 tokens which were in no less than 20 and no more than 0 (=10.0%) documents
2015-04-02 08:23:16,558 : INFO : resulting dictionary: Dictionary(0 unique tokens: [])
2015-04-02 08:23:16,558 : INFO : saving Dictionary object under data/dictionary, separately None
2015-04-02 08:23:16,558 : INFO : saving dictionary mapping to data/dictionary.txt
2015-04-02 08:23:16,558 : INFO : collecting document frequencies
2015-04-02 08:23:16,559 : INFO : calculating IDF weights for 0 documents and 0 features (0 matrix non-zeros)
2015-04-02 08:23:16,559 : INFO : saving TfidfModel object under data/tfidf.model, separately None
2015-04-02 08:23:16,559 : INFO : using serial LSI version on this node
2015-04-02 08:23:16,559 : INFO : updating model with new documents
2015-04-02 08:23:16,559 : INFO : saving Projection object under data/lsi.model.projection, separately None
2015-04-02 08:23:16,560 : INFO : saving LsiModel object under data/lsi.model, separately None
2015-04-02 08:23:16,560 : INFO : not storing attribute projection
2015-04-02 08:23:16,560 : INFO : not storing attribute dispatcher
Traceback (most recent call last):
File "./prepare_shootout.py", line 159, in __getitem__
assert self.projection.u is not None, "decomposition not initialized yet"
AssertionError: decomposition not initialized yet
.. but then get the above. I'll keep looking at it if you don't get to it.
Closing this one, I'll open another if I don't make progress on this most recent one.
Radim, perhaps you refactored gensim out from under your shootout code. I get the following when I try to run part of it:
(gensim-play).../gensim-play/shootout> ./run-prepare.sh data 2015-04-02 01:10:17,007 : INFO : running ./prepare_shootout.py /mnt/raid/torrents/enwiki-20140707-pages-articles.xml.bz2 data Traceback (most recent call last): File "./prepare_shootout.py", line 132, in
corpus = ShootoutCorpus(gensim.utils.smart_open(preprocessed_file))
File "/media/jd/7adf8f2f-25e0-4c88-8f0d-7d1a91ba07ec/home/jd/raid/test/projects/python-envs/gensim-play/local/lib/python2.7/site-packages/gensim-0.10.3-py2.7-linux-x86_64.egg/gensim/corpora/textcorpus.py", line 61, in init
self.dictionary.add_documents(self.get_texts())
File "/media/jd/7adf8f2f-25e0-4c88-8f0d-7d1a91ba07ec/home/jd/raid/test/projects/python-envs/gensim-play/local/lib/python2.7/site-packages/gensim-0.10.3-py2.7-linux-x86_64.egg/gensim/corpora/dictionary.py", line 119, in add_documents
for docno, document in enumerate(documents):
File "./prepare_shootout.py", line 92, in get_texts
lines = gensim.corpora.textcorpus.getstream(self.input) # open file/reset stream to its start
AttributeError: 'module' object has no attribute 'getstream'
For full context, run_prepare.sh is just your run_all.sh with a hardwired enwiki...bz2 file and the actual shootout removed, it's just preparation:
!/bin/bash
EXPECTED_ARGS=1 E_BADARGS=65
if [ $# -ne $EXPECTED_ARGS ] then
when run without params, print help and exit
echo "first argument must be a directory where data & indexes will be stored; make sure you have at least 100gb free space there" exit $E_BADARGS fi
datadir=$1 shift 1
first, download the raw wiki dump and convert it to LSI vectors, if not already present
articles=$datadir/lsi_vectors.mm.gz if [ ! -e $articles ]; then
wiki_file=$datadir/enwiki-latest-pages-articles.xml.bz2
fi
Thanks.