piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.66k stars 4.38k forks source link

Potential unification/optimization/simplification/enhancement refactor of *2Vec & related algorithms (FastText, Sent2Vec, FastSent, etc) #1623

Closed gojomo closed 4 years ago

gojomo commented 7 years ago

Word2Vec, Doc2Vec, FastText, FastSent (#612), Sent2Vec (#1376), 'Doc2VecWithCorruption' (#1159) and others are variants on the same core technique. They should share more code, and perhaps even be implemented as alternate parameter-choices on the same refactored core functions.

A big refactoring (including from-scratch API design) could potentially offer some or all of the following:

  1. sharing more code between different modes (SG/CBOW/DBOW/DM/FastText-classification/other), by discovering the ways they're parameterized variants of a shared process

  2. making other creative variations possible, even if just experimentally (different kinds of context-windows, dropout strategies, alternate learning-optimizations like AdaGrad/etc, re-weightings of individual examples/vectors, separate input/output vocabularies, 'bloom embeddings', more kinds of 'inference', etc)

  3. making it easier to use non-natural-language datasets, perhaps by providing ability to supply examples in an interim (raw-int-indexes) format (other than string tokens), and example transformer/caching classes that turn either texts or other corpuses into the right format

  4. eliminating the hard-to-maintain dual-path pure-Python & Cython implementations - perhaps by going to something like Numba-only, or removing the (performance-non-competitive) pure-Python paths whenever Cython code is clean enough

  5. avoiding common user errors & sources of confusion - by renaming parameters/methods, updating defaults, separating logically distinct steps into independent code/classes – then providing updated demo notebooks showing the new modes of operation

  6. throughput optimizations, including getting away from the 'master single-corpus-reader thread', or using processes rather than threads if that's the only way to avoid GIL contention bottlenecks

  7. separating vocabulary-management into explicitly different classes/objects, for more control/customization, perhaps including closer integration with new n-gram (phrasing) options

piskvorky commented 7 years ago

Re. 4: I agree. I feel we can drop the pure-Python compatibility in general. Running many of the algorithms (word2vec included) in pure Python is rarely what the users need; it's more often a bug, a problem with their setup.

A better approach is to make the installation more modular (optional installs). If somebody doesn't need module X (word2vec), don't force them to install C compiler or Fortran or whatnot.

Re. 6: threads are fine (we're already in C-land), but we should extend the single-corpus-reader API to multiple-corpus-readers to increase throughput. There has already been some discussion around this, although the issue link escapes me now.

Somewhat related to 7: we're releasing Bounter now (practically speaking, already released, but needs some final cleanup & promo). The Phrases in gensim will need a corresponding update too by @isamaru .

gojomo commented 7 years ago

Re: 4 – The pure Python has been nice as a welcoming illustration of the algorithms for beginners, or initial testbed... but maintaining both is a lot of overhead, and the performance is so non-competitive anyone still using it really needs to fix that 1st. The cython, though, is way uglier (for those who don't love C/C++) and ornery to debug... hence the hope that maybe Numba would allow more Pythonic code, with some restrictions, to still approach C/Cython performance. I haven't yet had experience with potential Numba-specific packaging/deploy issues, but they might be no worse than those with the the Cython DLLs.

Re: 6 - I'd hope better use of threads would be sufficient, but there may be levels of full utilization, especially on the sorts of 64- or 128- core machines more people are renting, that can't be achieved with the GIL. But training, especially, can work with just big shared arrays between multiple processes... so depending on the results of earlier optimizations, I could see multi-processing options appearing on the roadmap. (And certain refactorings or training modes could make that easier or harder.)

piskvorky commented 7 years ago

4: I have no practical experience with Numba myself, but my worry would be that it's hairier to deploy and distribute than plain C (C compilers are everywhere vs lock-in in the Continuum ecosystem), and with greatly reduced performance and ability to optimize code. The pro, on the other hand, is simper code and (much) easier maintenance and extendability. Worth exploring / benchmarking.

gojomo commented 7 years ago

Just after creating this issue, we received notice of a new FB research library, 'StarSpace', which tries to be a generic model for many text-embedding needs. The rationale, design, and features for any refactoring we do should take into account what StarSpace's choices & capabilities reveal.

gojomo commented 7 years ago

An idea prompted by the 'trans-gram' (#1629) request:

This refactor could include a sort of 'inversion of control', where instead of a single model controlling its own training loop, there's an external training-driver, that pushes examples/batches into the model. Then creating a hybrid model – alternating skip-gram and trans-gram, as in #1629 – would be an alternate driver, that pushes interleaved (batches of) examples to two internal models, which may also happen to share input (keyed-vector) and hidden weights, just not the format of examples or construction of input-contexts.

Are there any other ML libraries that are known to have a good API for arbitrarily combined training objectives like this?

souravsingh commented 7 years ago

Regarding Numba, there is a benchmark done here- https://github.com/wcbeard/word2vec and a full report written here- http://wcbeard.github.io/blog/2016/05/03/word2vec/ with Numba and Gensim's Cython code for word2vec. It mentions that Numba is 5 times slower than Gensim's Cython code, but is faster than Numpy code.

piskvorky commented 7 years ago

@souravsingh nice find. That's 5x slower than an (old) gensim version in single worker mode. How well can numba parallelize the computation is another question (in addition to the deployment & lock-in).

I like the idea of writing and maintaining only " pure Python" code, but C is not that hard, and gives much more flexibility and ubiquity. So we have to be careful with the trade-offs.

@gojomo StarSpace I tweeted out two days ago -- it looks super cool, worth benchmarking/evaluating. It's great for inspiration but C++ (with Boost!) is not something we'd like to maintain directly.

gojomo commented 7 years ago

I'm a little suspicious of that 5X benchmark for a few reasons: (1) the timing includes the pure-python vocabulary-discovery steps, so it's not a cython-training-only to numba-training-only comparison; (2) there's a grab-bag of optimizations applied to the 'numba' version, each of which is reported to help a lot, but not a clear indication that every potential avenue for such optimizations has been pursued; (3) suspiciously gensim word2vec needs 2 epochs to match the quality score his numba code gets in only 1, indicating some other unacknowledged difference in implementation.

Other Numba vs Cython microbenchmarks can show a Numba advantage (such as http://gouthamanbalaraman.com/blog/optimizing-python-numba-vs-cython.html), and the Numba code (by allowing releasing the GIL similarly frequently) will likely be capable of similar multithreading speedups as Cython. So a truer comparison might simply reimplement just the inner function where we switch to Cython, such as train_batch_sg() in the skip-gram case.

souravsingh commented 7 years ago

While I like Numba for the code optimizations, Another problem we could face potentially is keeping up with the project.

Numba pushes releases quite frequently, which is good for obtaining latest features and fixes, but could be bad if we are aiming to maintain code.

gojomo commented 7 years ago

Often a dependency's rapid releases (for improvements or bug-fixing) are a good thing! Or does Numba have a history of breaking backward-compatibility? (If so, isn't the simple and sufficient fix just to pin to known-working versions?)

souravsingh commented 7 years ago

@gojomo It's not as such breaking backward-compatibility, but occasional performance regressions when using @jit or similar decorators. We should be fine regarding this issue as long as we don't use too much of the decorators and take advantage of the vectorization. I am in favor of using Numba not only for speeding up the Word2Vec, but also other algorithms like LDA and HDP. What do you think?

menshikh-iv commented 7 years ago

@gojomo

Re (1), we need to organize the flexible hierarchy of classes (many classes with 2-3 methods). I think this allows for us to reuse code efficiently and final class of model will be very clear and simple. Also, this approach gives modularity for (2) point.

Re (2), it will be very nice if our models will be a constructor, many blocks that can be combined. The good example is BigARTM, but it slightly complicated. We need to think, how we should combine "flexibility" and "easy to use"

Re (3), besides this, we have another problem: non-trivial usage of gensim.corpora. I can't pass any corpus to any model (it's not only about *2vec, it's about all models). I would like to have a universal corpus that I can pass to any model.

Re (4), completely agree. I think we should drop python support for cythonized parts (because we have wheels for win/mac, for this reason, we can do it + with "pure python" models works very slow -> don't work for any big dataset, i.e. needs a lot of time and useless for real tasks). Numba is a good alternative for clear and fast code, but we can't "hack" all things for performance. This is tradeoff (simple & readable VS hight performance). Also, it will be nice to have very same code (without copypaste) for multiprocess/distributed versions (because all of this have very similar logic) + easy-to-make distributed learning.

Re (5), I'm with my student @anotherbugmaster starting to make new API reference, and after it, we'll work with notebooks (organization, structure, examples). I hope that this will be helpful for our users (well-documented methods with examples from notebooks is simple for understanding).

Re (6), what're variants you see to avoid this problem?

Re (7), very nice option for all models (something like "gensim.corpora.Dictionary" on steroids)

menshikh-iv commented 6 years ago

Do we consider it resolved @gojomo?

gojomo commented 6 years ago

I haven't had time to watch that PR closely; is there a write-up of any new capabilities, or a comparison of how new APIs are simpler or enable other new modes/possibilities compared to the old arrangement?

How many of the original ideas & 7 points have been addressed?

Are FastSent, Sent2Vec, Doc2VecWithCorruption and similar small variations now possibilities just by changing the parameters of some overarching flexible class, rather than writing a lot of new code?

Re: (1) Is more code shared, and less cut & paste duplication, evident across all these related modes?

Re: (2) Are any of the following now easier, perhaps as illustrated by a new example?

Re: (3) Are there any examples of how new classes work better for non-natural-language data?

Re: (4) Has the dual-path Python & Cython implementation been unified to a single path?

Re: (5) Have methods & parameters been renamed for clarity, and key API processes separated to difference classes, with updated notebooks, to make common operations easier & less error-prone?

Re: (6) Has throughput on typical datasets been improved? Can the bottleneck of the single reader-thread be avoided with the right kind of shardable corpuses?

Re: (7) Is vocabulary-management moved to a separate, reusable, less-coupled class/object, perhaps well-integrated with improve n-gram/phrasing options?

From a look over PR #1777, it may be doing valuable work, but it doesn't seem very strongly addressed towards the many possibilities that this issue was created to discuss.

menshikh-iv commented 6 years ago

@gojomo

Good overview available here

(1) Yes, copy-paste stay only for "backward compatibility" stuff (old models code), will be removed in major release

(2) I think yes, but I'm not sure (can't show an example right now, CC: @manneshiva)

(3) Now we have common structure as for Word2Vec (text data) as for Poincare (Graph)

(4) In the major release, we'll drop all "pure-python" stuff for avoiding "dual support" (this have no sense right now because Cython version is significantly faster + we already have wheels for all platforms #1731 -> no need to compile it from sources when you install it from pip).

(5) Partially yes.

(6) This isn't part of #1777, but this in plan for GSoC 2018 https://github.com/RaRe-Technologies/gensim/wiki/GSoC-2018-project-ideas#multiple-stream-api

(7) Yes, for example, as this done for word2vec - https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L1133

About your conclusion, I agree, leave open this issue.

gojomo commented 6 years ago

I'm not so sure of (1), as the patch resulted in a net-increase of over 8000 lines-of-code.

I believe any big wins in usability or generalizability will require discarding backward-compatibility, probably via a parallel, coexisting set of implementations. (That could also make the changes easier to evaluate in a single notebook, placing old and new examples alongside each other for style/naming/flexibility/performance contrasts.)

When I see a promising idea for reusability in the summary like a new separate "Vocabulary Class", that seems only very roughly achieved in the work so far. For example, I see a Word2VecVocab class, very tightly coupled to Word2Vec (to the extent it needs Word2Vec mode-parameters like hs and negative passed-in), and with a somewhat confusing name because Vocab is also the abbreviation used for the individual entries in the WordEmbeddingsKeyedVectors.vocab. But the object inside a WordEmbeddingsKeyedVectors.vocab property is not a Word2VecVocab, but a plain dict. This looks like the same code/model, just spread among more classes/inheritance-levels, without any new clarity or obviously-useful extension-points.

piskvorky commented 4 years ago

Ticket now outdated & mostly fixed in 4.0.0. For any outstanding specific suggestions, let's open separate tickets.