piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.63k stars 4.37k forks source link

LdaModel trains beyond size of corpus when using an iterable #2553

Closed gochristoph closed 4 years ago

gochristoph commented 5 years ago

Problem description

When streaming documents/bag of words to LdaModel via a custom iterable, LdaModel will train beyond the size of the corpus, with output like 19-07-05 22:53:43 PROGRESS: pass 0, at document #178000/50000 -- where the number left to the / is higher than the number right to it.

Steps/code/corpus to reproduce

from gensim.models import LdaModel
import logging
logging.basicConfig(format='%(asctime)s  %(message)s', \
    datefmt='%y-%m-%d %H:%M:%S', level=logging.INFO)

class TestIterable:
    def __init__(self):
        self.bag_of_words = [(0,2), (3,1), (6,1), (100,2)]
        self.cursor = 0

    def __iter__(self):
        self.cursor = 0
        logging.info('TestIterable() __iter__ was called')
        return self

    def __next__(self):
        if self.cursor < 50000:
            self.cursor += 1
            return self.bag_of_words
        else:
            logging.info('TestIterable() returned StopIteration')
            raise StopIteration

corpus = TestIterable()
# uncommenting this part will make a list out of the corpus
# corpus = [document for document in corpus]

logging.info('performing lda training')
trained_model = LdaModel(corpus, num_topics=2)

Using the TestIterable() will result in LdaModel training indefinitively. Converting the TestIterable() corpus to a list will lead to the expected result of a proper training.

I have not written too many iterables so far, and of course there could be a problem there. But as far as I could infer from the LdaModel documentation, all that is required is an interable -- and to the best of my knowledge, corpus = TestIterable() is a proper iterable, and iterator as well.

Thanks a lot!

Versions

Linux-3.10.0-862.14.4.el7.x86_64-x86_64-with-centos-7.5.1804-Core Python 3.6.4 (default, Apr 10 2018, 07:54:00) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] NumPy 1.14.2 SciPy 1.0.1 gensim 3.7.3 FAST_VERSION 0

piskvorky commented 5 years ago

Hm, that's pretty weird. Thanks for the detailed report.

I'll try to have a look what's going on, but probably not very soon, sorry :(

gochristoph commented 5 years ago

Hi Radim, thanks a lot -- looking forward to your take on it!

Hiyorimi commented 5 years ago

Is this still available?

piskvorky commented 5 years ago

@Hiyorimi yes! Sorry to @gochristoph , I never got around to this.

Hiyorimi commented 5 years ago

@mpenkov I believe that just getting the list won't do the trick, since it will slow down the training ?

mpenkov commented 5 years ago

Perhaps you want to iterate and return the first element instead of converting to a list?

piskvorky commented 5 years ago

@Hiyorimi the issue here is that LdaModel is reported to train indefinitely, on the given TestIterable object.

I can't imagine why it would do that – LdaModel only uses the iterable via the standard Python for x in iterable: … construct, IIRC. So I suspect the issue is with the iterable, not with LdaModel. But I can't see anything wrong at a glance, so the next step will be to replicate the issue and investigate.

It has nothing to do with converting to a list, that's just a way of debugging the iterable.

Hiyorimi commented 5 years ago

I added simple test case in separate branch.

Problem is that here we get wrapped chunk as :

wrapped_chunk = [list(itertools.islice(it, int(chunksize)))]

And called on TestIterable (or ProblematicIterable in tests), it will iterate forever.

It happens because TestIterable sets self.cursor on iter call. Being perfectly valid for getting sequence length, this setting triggers infinite chunk generation, once islice() being called on iterator.

I am not sure if that it is the valid case for LDAModel, probably @gochristoph can clarify, if that resolves the matter.

I can think only on one simple check, which is suggested in PR #2633

Hiyorimi commented 5 years ago

@piskvorky any news?)

piskvorky commented 5 years ago

Why does islice() create infinite recursion? I still don't get it, sorry.

Is it an issue with how the iterable is constructed, or with how Gensim consumes it?

Hiyorimi commented 5 years ago

It is an issue with how the iterable is constructed. It gets index reset every time iter is called.

piskvorky commented 5 years ago

I think that's desirable: each iteration = each for x in iterable: should start from the beginning. Why is that an issue?

Hiyorimi commented 5 years ago

Because on line 1163 we have:

it = iter(iterable)

I can't say I'm professional with iterables, but that is supposed to prevent iterator index from resetting, isn't it?

piskvorky commented 4 years ago

OK, I had a look. The issue is indeed that islice calls __iter__(self), which in turn resets self.cursor with each islice call.

So we're iterating over the same chunk of initial 2000 documents = result of islice(corpus, 2000), over and over, instead of advancing through the corpus.

@gochristoph the solution is to make TestIterable a proper iterable, returning a fresh iterator (in the example below, a generator) from each call to __iter__:

import logging
from gensim.models import LdaModel

logging.basicConfig(format='%(asctime)s  %(message)s', datefmt='%y-%m-%d %H:%M:%S', level=logging.INFO)

class TestIterable:
    def __iter__(self):
        logging.info('TestIterable() __iter__ was called')
        cursor = 0
        while cursor < 50000:
            cursor += 1
            yield [(0, 2), (3, 1), (6, 1), (100, 2)]

corpus = TestIterable()
# uncommenting this part will make a list out of the corpus
# corpus = [document for document in corpus]

logging.info('performing lda training')
trained_model = LdaModel(corpus, num_topics=2)

This code works and is also simpler.

Does that make sense?

gochristoph commented 4 years ago

@Hiyorimi, @piskvorky Thanks for looking into it.

The issue that I see is that this leads to inconsistent behavior, and that the workaround is implemented on the outside (so not in LdaModel, but in the user code). Word2Vec works perfectly with the iterable, but LdaModel does not. And I guess the behavior should be consistent.

@piskvorky We did not choose they way of wrapping a generator as an iterable because it is more transparent to directly construct the iterable for our purposes (fetching data from a database with a server-side cursor). Can you point to why the TestIterable in the initial expample is not a proper iterable? As far as I can see it iter() and next() are doing the jobs they are supposed to do.

Thanks a lot!

piskvorky commented 4 years ago

Perhaps. This behaviour was surprising to me too.

I guess one solution would be to replace the built-in itertools.islice() in utils.grouper() (used in LdaModel to read chunks of documents from your corpus) by something that treats its input as an iterator, without calling iter() on it again.

Since your TestIterable is both an iterable and its own iterator, sharing the same state, calling islice behaves weirdly. But even if we avoid islice, it's possible you'll hit issues elsewhere. Returning a new iterator from your __iter__ looks both saner and safer.

Where you fetch data from (database or otherwise) is not relevant, you can do that with a generator too.

CC @mpenkov @gojomo thoughts?

gojomo commented 4 years ago

IMO, __iter__() should, by Python's conventions, return a new iterator, and any call to __iter__() should have no effect on any prior returned iterators from __iter__() (no matter their state/progress/exhaustion).

That an iterator works fine with one particular client (in this case Word2Vec) isn't proof it's a proper iterator for all clients who are counting on usual behavior.

For example, I would expect any iterable to be able to pass the following tests:

def test_serial_iterators(iterable):
    i1 = iter(iterable)
    len1 = sum(1 for _ in i1)
    i2 = iter(iterable)
    len2 = sum(1 for _ in i2)
    assert len1 == len2  # repeated iterations should work & report same number of items

def test_concurrent_iterators(iterable):
    assert sum(1 for _ in iter(iterable)) > 1  # ensure at least 2 items present
    i1 = iter(iterable)
    _ = next(i1)
    i2 = iter(iterable)
    len1 = 1 + sum(1 for _ in i1)
    len2 = sum(1 for _ in i2)
    assert len1 == len2  # concurrent iterations should work & report same number of items

The supplied TestIterable won't pass the second test (while all properly-behaving Python iterables will), thus it is the code that needs fixing, not islice() of gensim, which is just expecting the usual behavior.

(That first test_serial_iterators() method is what demonstrates an iterator itself, capable of only a single iteration, is not a full iterable.)

piskvorky commented 4 years ago

Looks like we're in agreement here.

I still think getting rid of islice() in utils.grouper() is desirable. I see no reason for such weird behaviour, it's confusing – we should simply advance the it variable inside the loop, without calling iter on it each time (treat it as an iterator, not iterable).

gojomo commented 4 years ago

It's a little odd to be calling iter() on something that all subsequent paths (via islice()) also effectively call iter() on again, but I'm not sure there's any weird/surprising externally-observable behavior as long as a proper iterable-sequence is provided. (The parameter's name is, after all, iterable.) So I'm not sure there's anything needing fixing, though perhaps the code could be clearer/prettier.

To that end, this implementation has always seemed a bit clunky to me. It has always "felt like" there should be a more Pythonic, compact idiom for doing this sort of grouping – but I've not come across one. (The grouper() in the itertools docs recipes is almost, but not quite, right due to the filling Nones.) It now occurs to me that groupby() with a synthetic grouping-key could do the trick. EG, roughly (not yet considering the numpy-optimization option):

from itertools import count, groupby
def grouper_no_filling(iterable, n): 
    groups = (i for i in count() for _ in range(n))  # infinite generator
    for k, g in groupby(iterable, key=lambda item: next(groups)):
        yield list(g)

Maybe replace existing gensim grouper()/chunkize_serial() with a call to this (or some numpy-value-optimized-version when that's elected)?

piskvorky commented 4 years ago

AFAIK what we have now is the "standard", idiomatic way of grouping an iterable.

Replacing that with another convoluted itertools construct seems an overkill to me. Iterating over a sequence is trivial enough. We just need a simple loop: explicit and clear. Performance is likely no bottleneck here.

To be clear, this discussion is no longer about the original issue (where we agree the user-supplied "shared-iterators-iterable" was the problem). It's about a cosmetic change where we could group an iterable while calling iter() on it only once, which seems more correct to me than the current version… even if there's no external difference for well-behaved iterables.

piskvorky commented 4 years ago

Looking how others solve this, I googled up bolton's version: https://boltons.readthedocs.io/en/latest/_modules/boltons/iterutils.html#chunked_iter

But it's basically what we have in Gensim now (i.e. iter + islice), sans our memory optimization.

gojomo commented 4 years ago

I don't particularly see any 1-to-2-line application of itertools as being "convoluted" – even where its operation isn't maximally readable, if it reliably does whatever simple thing it says it does, the compactness & reliance on standard-library constructs can justify its use.

That boltons implementation strikes me as overkill – but if that library is considered a rugged, reliable source of functions, should we just bring it in and re-use its implementation?

I see the gensim numpy optimization as something that would have been just as efficient, and clearer, to do as a map-wrapping step totally outside the utility function – in only those places where the caller knows that np.array-compactness & cross-thread communication is important. (It's not like the function can surprise unaware callers with a more-efficient return-value, since callers have to request the optimization with a non-default as_numpy=True argument.)

Also, I'm not sure about the theory supporting the "memory opt: wrap the chunk and then pop(), to avoid leaving behind a dangling reference" wrap/pop. (I don't see how that would effectively break any "dangling reference".)

In fact, for generalizability/composability, the utility function could yield generators in the style of itertools, and only callers who need lists would convert those generators. For example, replacing the existing chunkize_serial() with:

def grouper_unfilled(iterable, n): 
    groups = (i for i in count() for _ in range(n))  # infinite generator
    for k, g in groupby(iterable, key=lambda item: next(groups)):
        yield g

# for backward-compatibility until all callers expecting old lists/arrays are coaxed to use above directly
def chunkize_serial(iterable, chunksize, as_numpy=False, dtype=np.float32):
    mapfn = list
    if as_numpy:
        mapfn = lambda item: np.array(item, dtype=dtype)
    return map(mapfn, grouper_unfilled(iterable, chunksize))

grouper = chunkize_serial  # more backward-compatibility
piskvorky commented 4 years ago

I see what you mean, but -1 on that, unless profiling shows some tremendous performance benefit. Way too clever, unreadable.

My preference is a plain:

def chunkize_serial(iterable, chunksize):
    current_chunk = [[]]
    for value in iterable:
        current_chunk[0].append(value)
        if len(current_chunk[0]) >= chunksize:
            yield current_chunk.pop()
            current_chunk = [[]]
    if current_chunk[0]:
        yield current_chunk.pop()

Or maybe even without the "memory trick" of avoiding a named variable refcount via pop(). That optimization is probably not necessary any more, who cares about a bit of extra memory.