piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.65k stars 4.37k forks source link

Word2Vec sentences stream iterator does not reach stop condition #967

Closed loretoparisi closed 8 years ago

loretoparisi commented 8 years ago

My corpus is loaded with a iterator like this

class LyricsCorpus(object):

    def __init__(self, corpus, tokenize=False, deaccent=False):
        self.corpus = corpus
        self.tokenize = tokenize
        self.deaccent = deaccent

    def __iter__(self):
        for fname in os.listdir(self.corpus):
            with open( os.path.join(self.corpus, fname) ) as data_file:
                print "loading corpora file %s..." % fname
                data = json.load(data_file)
                for item in data:
                    if "lyrics" in item:
                        if "lyrics_body" in item["lyrics"]:
                            if self.tokenize:
                                yield self.tokens( item["lyrics"]["lyrics_body"] )
                            else:
                                yield item["lyrics"]["lyrics_body"].split()
    '''
        This lowercases, tokenizes, de-accents (optional). – the output are final tokens = unicode strings, that won’t be processed any further.
    '''
    def tokens(self,text):
        return [token for token in simple_preprocess(text, deacc=self.deaccent, min_len=2, max_len=15) if token not in STOPWORDS]

When I load it to Word2Vec the iteration does not stop doing like a loop:

loading corpora file charts_lyrics.json...
loading corpora file charts_lyrics_2.json...
loading corpora file charts_lyrics.json...
loading corpora file charts_lyrics_2.json...
loading corpora file charts_lyrics.json...
loading corpora file charts_lyrics_2.json...
loading corpora file charts_lyrics.json...
loading corpora file charts_lyrics_2.json...
...
min_count = 1
size = 100
window = 4
model = Word2Vec(corpus_iterator, min_count=min_count, size=size, window=window)

If I change the __iter__ to load just one file like

    def __iter__(self):
        with open( self.fname ) as data_file:
                print "loading corpora file %s..." % fname
                data = json.load(data_file)
                for item in data:
                    if "lyrics" in item:
                        if "lyrics_body" in item["lyrics"]:
                            if self.tokenize:
                                yield self.tokens( item["lyrics"]["lyrics_body"] )
                            else:
                                yield item["lyrics"]["lyrics_body"].split()

it works. If I do a simple iterator like

from LyricsCorpus import *
it=LyricsCorpus('./corpus')
print [item for item in it]

it iterates in the right way.

Do I need in the first case a raise StopIteration exit condition? If so, why this does not happen for one for loop in the second case?

gojomo commented 8 years ago

Word2Vec iterates once over your corpus to do vocabulary-discovery (aka build_vocab()), then multiple times (controlled by iter parameter, default 5) for training. So this output doesn't indicate anything wrong to me, unless it goes on forever. (Does it?)

Also, such questions that aren't necessarily bugs or feature-requests are best discussed on the project forum, https://groups.google.com/forum/#!forum/gensim , rather than as Github issues.

loretoparisi commented 8 years ago

@gojomo Thank you I have found it out and solved, I was think it was a bug since times ago there was in the Word2Vec init an issue on the generators.