Closed gojomo closed 4 years ago
Re. 4: I agree. I feel we can drop the pure-Python compatibility in general. Running many of the algorithms (word2vec included) in pure Python is rarely what the users need; it's more often a bug, a problem with their setup.
A better approach is to make the installation more modular (optional installs). If somebody doesn't need module X (word2vec), don't force them to install C compiler or Fortran or whatnot.
Re. 6: threads are fine (we're already in C-land), but we should extend the single-corpus-reader API to multiple-corpus-readers to increase throughput. There has already been some discussion around this, although the issue link escapes me now.
Somewhat related to 7: we're releasing Bounter now (practically speaking, already released, but needs some final cleanup & promo). The Phrases in gensim will need a corresponding update too by @isamaru .
Re: 4 – The pure Python has been nice as a welcoming illustration of the algorithms for beginners, or initial testbed... but maintaining both is a lot of overhead, and the performance is so non-competitive anyone still using it really needs to fix that 1st. The cython, though, is way uglier (for those who don't love C/C++) and ornery to debug... hence the hope that maybe Numba would allow more Pythonic code, with some restrictions, to still approach C/Cython performance. I haven't yet had experience with potential Numba-specific packaging/deploy issues, but they might be no worse than those with the the Cython DLLs.
Re: 6 - I'd hope better use of threads would be sufficient, but there may be levels of full utilization, especially on the sorts of 64- or 128- core machines more people are renting, that can't be achieved with the GIL. But training, especially, can work with just big shared arrays between multiple processes... so depending on the results of earlier optimizations, I could see multi-processing options appearing on the roadmap. (And certain refactorings or training modes could make that easier or harder.)
4: I have no practical experience with Numba myself, but my worry would be that it's hairier to deploy and distribute than plain C (C compilers are everywhere vs lock-in in the Continuum ecosystem), and with greatly reduced performance and ability to optimize code. The pro, on the other hand, is simper code and (much) easier maintenance and extendability. Worth exploring / benchmarking.
Just after creating this issue, we received notice of a new FB research library, 'StarSpace', which tries to be a generic model for many text-embedding needs. The rationale, design, and features for any refactoring we do should take into account what StarSpace's choices & capabilities reveal.
An idea prompted by the 'trans-gram' (#1629) request:
This refactor could include a sort of 'inversion of control', where instead of a single model controlling its own training loop, there's an external training-driver, that pushes examples/batches into the model. Then creating a hybrid model – alternating skip-gram and trans-gram, as in #1629 – would be an alternate driver, that pushes interleaved (batches of) examples to two internal models, which may also happen to share input (keyed-vector) and hidden weights, just not the format of examples or construction of input-contexts.
Are there any other ML libraries that are known to have a good API for arbitrarily combined training objectives like this?
Regarding Numba, there is a benchmark done here- https://github.com/wcbeard/word2vec and a full report written here- http://wcbeard.github.io/blog/2016/05/03/word2vec/ with Numba and Gensim's Cython code for word2vec. It mentions that Numba is 5 times slower than Gensim's Cython code, but is faster than Numpy code.
@souravsingh nice find. That's 5x slower than an (old) gensim version in single worker mode. How well can numba parallelize the computation is another question (in addition to the deployment & lock-in).
I like the idea of writing and maintaining only " pure Python" code, but C is not that hard, and gives much more flexibility and ubiquity. So we have to be careful with the trade-offs.
@gojomo StarSpace I tweeted out two days ago -- it looks super cool, worth benchmarking/evaluating. It's great for inspiration but C++ (with Boost!) is not something we'd like to maintain directly.
I'm a little suspicious of that 5X benchmark for a few reasons: (1) the timing includes the pure-python vocabulary-discovery steps, so it's not a cython-training-only to numba-training-only comparison; (2) there's a grab-bag of optimizations applied to the 'numba' version, each of which is reported to help a lot, but not a clear indication that every potential avenue for such optimizations has been pursued; (3) suspiciously gensim word2vec needs 2 epochs to match the quality score his numba code gets in only 1, indicating some other unacknowledged difference in implementation.
Other Numba vs Cython microbenchmarks can show a Numba advantage (such as http://gouthamanbalaraman.com/blog/optimizing-python-numba-vs-cython.html), and the Numba code (by allowing releasing the GIL similarly frequently) will likely be capable of similar multithreading speedups as Cython. So a truer comparison might simply reimplement just the inner function where we switch to Cython, such as train_batch_sg()
in the skip-gram case.
While I like Numba for the code optimizations, Another problem we could face potentially is keeping up with the project.
Numba pushes releases quite frequently, which is good for obtaining latest features and fixes, but could be bad if we are aiming to maintain code.
Often a dependency's rapid releases (for improvements or bug-fixing) are a good thing! Or does Numba have a history of breaking backward-compatibility? (If so, isn't the simple and sufficient fix just to pin to known-working versions?)
@gojomo It's not as such breaking backward-compatibility, but occasional performance regressions when using @jit
or similar decorators. We should be fine regarding this issue as long as we don't use too much of the decorators and take advantage of the vectorization. I am in favor of using Numba not only for speeding up the Word2Vec, but also other algorithms like LDA and HDP. What do you think?
@gojomo
Re (1), we need to organize the flexible hierarchy of classes (many classes with 2-3 methods). I think this allows for us to reuse code efficiently and final class of model will be very clear and simple. Also, this approach gives modularity for (2) point.
Re (2), it will be very nice if our models will be a constructor, many blocks that can be combined. The good example is BigARTM, but it slightly complicated. We need to think, how we should combine "flexibility" and "easy to use"
Re (3), besides this, we have another problem: non-trivial usage of gensim.corpora
. I can't pass any corpus to any model (it's not only about *2vec, it's about all models). I would like to have a universal corpus that I can pass to any model.
Re (4), completely agree. I think we should drop python support for cythonized parts (because we have wheels for win/mac, for this reason, we can do it + with "pure python" models works very slow -> don't work for any big dataset, i.e. needs a lot of time and useless for real tasks). Numba is a good alternative for clear and fast code, but we can't "hack" all things for performance. This is tradeoff (simple & readable VS hight performance). Also, it will be nice to have very same code (without copypaste) for multiprocess/distributed versions (because all of this have very similar logic) + easy-to-make distributed learning.
Re (5), I'm with my student @anotherbugmaster starting to make new API reference, and after it, we'll work with notebooks (organization, structure, examples). I hope that this will be helpful for our users (well-documented methods with examples from notebooks is simple for understanding).
Re (6), what're variants you see to avoid this problem?
Re (7), very nice option for all models (something like "gensim.corpora.Dictionary" on steroids)
Do we consider it resolved @gojomo?
I haven't had time to watch that PR closely; is there a write-up of any new capabilities, or a comparison of how new APIs are simpler or enable other new modes/possibilities compared to the old arrangement?
How many of the original ideas & 7 points have been addressed?
Are FastSent, Sent2Vec, Doc2VecWithCorruption and similar small variations now possibilities just by changing the parameters of some overarching flexible class, rather than writing a lot of new code?
Re: (1) Is more code shared, and less cut & paste duplication, evident across all these related modes?
Re: (2) Are any of the following now easier, perhaps as illustrated by a new example?
Re: (3) Are there any examples of how new classes work better for non-natural-language data?
Re: (4) Has the dual-path Python & Cython implementation been unified to a single path?
Re: (5) Have methods & parameters been renamed for clarity, and key API processes separated to difference classes, with updated notebooks, to make common operations easier & less error-prone?
Re: (6) Has throughput on typical datasets been improved? Can the bottleneck of the single reader-thread be avoided with the right kind of shardable corpuses?
Re: (7) Is vocabulary-management moved to a separate, reusable, less-coupled class/object, perhaps well-integrated with improve n-gram/phrasing options?
From a look over PR #1777, it may be doing valuable work, but it doesn't seem very strongly addressed towards the many possibilities that this issue was created to discuss.
@gojomo
Good overview available here
(1) Yes, copy-paste stay only for "backward compatibility" stuff (old models code), will be removed in major release
(2) I think yes, but I'm not sure (can't show an example right now, CC: @manneshiva)
(3) Now we have common structure as for Word2Vec
(text data) as for Poincare
(Graph)
(4) In the major release, we'll drop all "pure-python" stuff for avoiding "dual support" (this have no sense right now because Cython version is significantly faster + we already have wheels for all platforms #1731 -> no need to compile it from sources when you install it from pip
).
(5) Partially yes.
(6) This isn't part of #1777, but this in plan for GSoC 2018 https://github.com/RaRe-Technologies/gensim/wiki/GSoC-2018-project-ideas#multiple-stream-api
(7) Yes, for example, as this done for word2vec - https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L1133
About your conclusion, I agree, leave open this issue.
I'm not so sure of (1), as the patch resulted in a net-increase of over 8000 lines-of-code.
I believe any big wins in usability or generalizability will require discarding backward-compatibility, probably via a parallel, coexisting set of implementations. (That could also make the changes easier to evaluate in a single notebook, placing old and new examples alongside each other for style/naming/flexibility/performance contrasts.)
When I see a promising idea for reusability in the summary like a new separate "Vocabulary Class", that seems only very roughly achieved in the work so far. For example, I see a Word2VecVocab
class, very tightly coupled to Word2Vec (to the extent it needs Word2Vec mode-parameters like hs
and negative
passed-in), and with a somewhat confusing name because Vocab
is also the abbreviation used for the individual entries in the WordEmbeddingsKeyedVectors.vocab
. But the object inside a WordEmbeddingsKeyedVectors.vocab
property is not a Word2VecVocab
, but a plain dict. This looks like the same code/model, just spread among more classes/inheritance-levels, without any new clarity or obviously-useful extension-points.
Ticket now outdated & mostly fixed in 4.0.0. For any outstanding specific suggestions, let's open separate tickets.
Word2Vec, Doc2Vec, FastText, FastSent (#612), Sent2Vec (#1376), 'Doc2VecWithCorruption' (#1159) and others are variants on the same core technique. They should share more code, and perhaps even be implemented as alternate parameter-choices on the same refactored core functions.
A big refactoring (including from-scratch API design) could potentially offer some or all of the following:
sharing more code between different modes (SG/CBOW/DBOW/DM/FastText-classification/other), by discovering the ways they're parameterized variants of a shared process
making other creative variations possible, even if just experimentally (different kinds of context-windows, dropout strategies, alternate learning-optimizations like AdaGrad/etc, re-weightings of individual examples/vectors, separate input/output vocabularies, 'bloom embeddings', more kinds of 'inference', etc)
making it easier to use non-natural-language datasets, perhaps by providing ability to supply examples in an interim (raw-int-indexes) format (other than string tokens), and example transformer/caching classes that turn either texts or other corpuses into the right format
eliminating the hard-to-maintain dual-path pure-Python & Cython implementations - perhaps by going to something like Numba-only, or removing the (performance-non-competitive) pure-Python paths whenever Cython code is clean enough
avoiding common user errors & sources of confusion - by renaming parameters/methods, updating defaults, separating logically distinct steps into independent code/classes – then providing updated demo notebooks showing the new modes of operation
throughput optimizations, including getting away from the 'master single-corpus-reader thread', or using processes rather than threads if that's the only way to avoid GIL contention bottlenecks
separating vocabulary-management into explicitly different classes/objects, for more control/customization, perhaps including closer integration with new n-gram (phrasing) options