Further focus/slim keyedvectors.py module

gojomo commented 4 years ago

Pre-#2698, keyedvectors.py was 2500+ lines, including functionality over-specific to other models, & redundant classes. Post-#2698, with some added generic functionality, it's still over 1800 lines.

It should shed some other grab-bag utility functions that have accumulated, & don't logically fit inside the KeyedVectors class.

In particular, the evaluation (analogies, word_ranks) helpers could move to their own module that takes a KV instance as an argument. (If other more-sophisticated evaluations can be contributed, as would be welcome, they should also live alongside those, rather than bloating KeyedVectors.)

The get_keras_embedding method, as its utilit is narrow to very specific uses, and is conditional on a not-necessarily install package, could go elsewhere too – either a kera-focused utilities module, or even just documentation/example code about how to convert to/from keras from `KeyedVectors.

Some of the more advanced word-vector-using calculations, like 'Word Mover's Distance' or 'Soft Cosine SImilarity', could move to method-specific modules that are then better documented/self-contained/optimized, without bloating the generic 'set of vectors' module. (They might be more discoverable, there, as well.)

And finally, some of the existing calculations could be unified/streamlined (especially the two variants of most_similar(), and some of the steps shared by multiple operations). My hope would be the module is eventually <1000 lines.

piskvorky commented 4 years ago

@gojomo do you see this as essential for 4.0.0 = API breaking?

Or can we leave it for a later release?

gojomo commented 4 years ago

This is low-risk and not-hard, but also low-priority.

Making the decision that some things should be relocated would be nicer to do in 4.0.0, along with other "update your imports/code/function-names" changes, but could wait.

piskvorky commented 4 years ago

@mpenkov do you feel like taking this up, reshuffling / organizing KeyedVectors?

I already resolved the get_keras_embedding part in #2937.

gojomo commented 4 years ago

My 1st thoughts would be:

move evaluation code to its own module
move WMD calcs to its own module (even if a passthrough method remains here indefinitely); the options/optimizations for WMD might improve more beneficially in its own space (and evolve in parallel with other extensions like 'soft cosine')

Next but lower-priority, some of the other utility methods may be amenable to more reuse. (Perhaps, the cosmul as an option to most_similar, or some methods being redefined in terms of fewer more-central operations.)

Finally, after everything else has settled, the methods should be reordered by importance & grouped by role, so the autogenerated documentation has the most-used stuff up top, and a casual top-to-bottom read makes more sense.

mpenkov commented 4 years ago

@mpenkov do you feel like taking this up, reshuffling / organizing KeyedVectors?

@piskvorky Sorry I missed this. I'm in the middle of a house move, so I'd rather not get involved until things have settled down.

If this can wait a couple of weeks, then I'd be happy to pick it up then. It looks like my sort of thing, and I've done it a couple of times with gensim already.

piskvorky commented 4 years ago

If this can wait a couple of weeks

Yes, it can. In order of urgency:

Release 4.0.0beta – hopefully this coming week (only TODO left are release/migration docs = my task). I'd like you to do the release process though (or together).
A few weeks to tie up the other loose ends in https://github.com/RaRe-Technologies/gensim/milestone/3 – hopefully including this one.
Release 4.0.0rc1 / full 4.0.0.

mpenkov commented 3 years ago

How about moving the high-level methods from keyedvectors.py to a separate wordtasks.py submodule? They could be pure functions there. For example:

keyedvectors.KeyedVectors.most_similar(self, ...) -> wordtasks.most_similar(model, ...)
keyedvectors.KeyedVectors.similar_by_word(self, ...) -> wordtasks.similar_by_word(model, ...)
keyedvectors.KeyedVectors.similar_by_key -> and so on...
keyedvectors.KeyedVectors.similar_by_vector
keyedvectors.KeyedVectors.wmdistance
keyedvectors.KeyedVectors.most_similar_cosmul
keyedvectors.KeyedVectors.rank_by_centrality
keyedvectors.KeyedVectors.doesnt_match
keyedvectors.KeyedVectors.cosine_similarities
keyedvectors.KeyedVectors.distances
keyedvectors.KeyedVectors.evaluate_word_pairs
keyedvectors.KeyedVectors.evaluate_word_analogies

All the above operations do not modify the keyedvectors model, they are read-only.

This would leave the lower-level IO stuff in keyedvectors.py, so serialization (loading/saving) shouldn't be affected, as far as I understand.

The name wordtasks comes from the keyedvectors docstring:

What can I do with word vectors? You can perform various syntactic/semantic NLP word tasks with the trained vectors.

WDYT?

mpenkov commented 3 years ago

@gojomo Let's remove this from the 4.0 milestone and deal with it later

piskvorky / gensim

Further focus/slim keyedvectors.py module #2873