piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.63k stars 4.37k forks source link

Support for async iterators as Corpus #2374

Closed mhham closed 5 years ago

mhham commented 5 years ago

It would be interesting to support asynchronous iterators for usage as a Corpus in all the transforms. This is especially interesting in case the corpus is stored on a remote device/database.

More precisely, I mean enabling the usage of such a corpus:

class AsyncCorpus:
    def __aiter__(self):
        return self

    async def __anext__(self):
        data = await self.fetch_data()
        if data:
            return data
        else:
            raise StopAsyncIteration

    async def fetch_data(self):
        ...

as:

tfidf_model = gensim.models.TfidfModel(corpus)

or

tfidf_corpus = tfidf_model[corpus]

where corpus is an AsyncCorpus

A possibility would be to transform the asynchronous iterator into a synchronous one, as suggested here.

piskvorky commented 5 years ago

I don't know much about async iterators. What would be the advantage of this, over normal iterators (which already support remote devices/databases)?

mhham commented 5 years ago

In terms of how gensim models currently work (synchronously), they don't present any advantage compared to synchronous iterators (IMO). That is why the simple starting solution that I suggested involves transforming asynchronous -> synchronous iterators.

However, they are really useful in network bound or IO bound environments where they allow for concurrency, which makes things faster. (e.g. if the corpus is stored on a database that is called asynchronously).

So it might be interesting to consider implementing asynchronous gensim models. That way, if a single document of the corpus takes a long time to be downloaded/processed, it does not lock the whole process.

piskvorky commented 5 years ago

That way, if a single document of the corpus takes a long time to be downloaded/processed, it does not lock the whole process.

I think I understand what you mean. You'd like to be pre-loading document(s) while the model training runs, right?

We implement such async pre-loading via a background process (although threads would be OK too for I/O-bound processes, because of GIL), in utils.chunkize.

Preloading documents into RAM indefinitely is not a good idea (the corpus can be larger than RAM), and pre-loading a single document at a time when the model requests it would be too slow. You'd want a "chunking" preloader… which is what we're doing now, and what we encourage users to do with their own corpus readers.

I'll close this now but feel free to reopen if you have a clear motivating example, it sounds interesting. Thanks!

mhham commented 5 years ago

Thank you for your answer !

I think I understand what you mean. You'd like to be pre-loading document(s) while the model training runs, right?

That is one of the advantages of working asynchronously indeed !

We implement such async pre-loading via a background process (although threads would be OK too for I/O-bound processes, because of GIL), in utils.chunkize.

Yes, that is an excellent feature, but it complements asynchronous iterators rather than replace them. In fact your implementation works only with synchronous iterators (corpus as iterable objects). The problem is that asynchronous iterators are NOT iterable in the synchronous sense (using a for loop).

A simple example would be if I store my corpus on a mongo db, and query it using an asynchronous client ( like Motor ). It is impossible to get a synchronous iterator this way, what we get is an asynchronous one, that we iterate on using async for instead of for.

There are several ways of solving this: 1) Dropping the asynchronous client and using a synchronous one, which is not great, since the mongo client is not only used for querying the corpus, but also for other tasks that require asynchronous behaviour. 2) Transforming the async iterator to a synchronous one. 3) Implementing async iterables in gensim (which is really not straightforward, cf. the distinction between asynchronous and synchronous mongo client (Motor and PyMongo)

The asynchronous syntax in Python is great for Network and IO bound tasks, and it would be great to be able to use it with gensim too !