Open gojomo opened 4 years ago
@gojomo do you see this as essential for 4.0.0 = API breaking?
Or can we leave it for a later release?
This is low-risk and not-hard, but also low-priority.
Making the decision that some things should be relocated would be nicer to do in 4.0.0, along with other "update your imports/code/function-names" changes, but could wait.
@mpenkov do you feel like taking this up, reshuffling / organizing KeyedVectors?
I already resolved the get_keras_embedding
part in #2937.
My 1st thoughts would be:
Next but lower-priority, some of the other utility methods may be amenable to more reuse. (Perhaps, the cosmul
as an option to most_similar
, or some methods being redefined in terms of fewer more-central operations.)
Finally, after everything else has settled, the methods should be reordered by importance & grouped by role, so the autogenerated documentation has the most-used stuff up top, and a casual top-to-bottom read makes more sense.
@mpenkov do you feel like taking this up, reshuffling / organizing KeyedVectors?
@piskvorky Sorry I missed this. I'm in the middle of a house move, so I'd rather not get involved until things have settled down.
If this can wait a couple of weeks, then I'd be happy to pick it up then. It looks like my sort of thing, and I've done it a couple of times with gensim already.
If this can wait a couple of weeks
Yes, it can. In order of urgency:
How about moving the high-level methods from keyedvectors.py to a separate wordtasks.py submodule? They could be pure functions there. For example:
All the above operations do not modify the keyedvectors model, they are read-only.
This would leave the lower-level IO stuff in keyedvectors.py, so serialization (loading/saving) shouldn't be affected, as far as I understand.
The name wordtasks comes from the keyedvectors docstring:
What can I do with word vectors? You can perform various syntactic/semantic NLP word tasks with the trained vectors.
WDYT?
@gojomo Let's remove this from the 4.0 milestone and deal with it later
Pre-#2698,
keyedvectors.py
was 2500+ lines, including functionality over-specific to other models, & redundant classes. Post-#2698, with some added generic functionality, it's still over 1800 lines.It should shed some other grab-bag utility functions that have accumulated, & don't logically fit inside the
KeyedVectors
class.In particular, the evaluation (analogies, word_ranks) helpers could move to their own module that takes a KV instance as an argument. (If other more-sophisticated evaluations can be contributed, as would be welcome, they should also live alongside those, rather than bloating
KeyedVectors
.)The
get_keras_embedding
method, as its utilit is narrow to very specific uses, and is conditional on a not-necessarily install package, could go elsewhere too – either a kera-focused utilities module, or even just documentation/example code about how to convert to/from keras from `KeyedVectors.Some of the more advanced word-vector-using calculations, like 'Word Mover's Distance' or 'Soft Cosine SImilarity', could move to method-specific modules that are then better documented/self-contained/optimized, without bloating the generic 'set of vectors' module. (They might be more discoverable, there, as well.)
And finally, some of the existing calculations could be unified/streamlined (especially the two variants of
most_similar()
, and some of the steps shared by multiple operations). My hope would be the module is eventually <1000 lines.