Remove/refactor useless subpackages

menshikh-iv commented 7 years ago

In gensim, we have many sub-packages, but several of this should be a part of another subpackage, another part is broken/useless and should be removed.

Candidates:

[x] /examples - Old broken code + confused users with name, should be removed.
[x] ~/parsing - Not relevant for gensim (all of this already implemented in NLTK/etc.), should be removed (but before, need to check that this isn't used).~
[x] /scripts - Many scripts for Wikipedia + symlinks + some conversions + w2v_standalone. Need to look into all wiki scripts and understand why this needed, remove all that no needed (w2v_standalone too).
[ ] ~/summarization - need to refactor code and create one model, no need distinct subpackage for this.~
[x] ~/topic_coherence - same as summarization~
[x] nose.py - unused runner for nose, should be removed.

souravsingh commented 7 years ago

@menshikh-iv Should all the scripts in the examples directory be removed?

menshikh-iv commented 7 years ago

@souravsingh Yes, but I personally will works with this, because we need to do everything very carefully (I mean all issues from project)

menshikh-iv commented 7 years ago

Update

We discussed it with @macks22 in #1607 and I thinks we shouldn't miss good hierarchy, but we should do it more carefully and deeply, for this reason

[x] gensim.utils should be split to several "subsubpackages":
- gensim.utils.utils for old gensim.utils
- gensim.utils.text_utils for gensim.parsing
- etc
[x] gensim.topic_coherence should be moved with renaming to gensim.models._coherence
[x] gensim.summarization should be moved to gensim.models.summarization (without creation distinct model)
[x] gensim.parsing should be cleaned and used as text_utils in gensim.utils.text_utils

СС @piskvorky @macks22 wdyt?

piskvorky commented 7 years ago

-1 on gensim.utils.utils -- doesn't look like a good naming scheme.

gensim.utils.parsing - what is the point? Why not simply remove it?

gensim.models.coherence_inner - what is this about? Why inner?

summarization: OK to refactor to a single file under models. Not sure what you mean by (without creation distrinct model).

menshikh-iv commented 7 years ago

gensim.utils.parsing - what is the point? Why not simply remove it?

Because most part of this functions used in different places (mainly in notebooks), I'll remove part of this (that never used, but not all)

gensim.models.coherence_inner - what is this about? Why inner?

Because for CoherenceModel we need a lot of additional well-structured functions, inner because it's for internal usage (in CoherenceModel)

summarization: OK to refactor to a single file under models. Not sure what you mean by (without creation distinct model).

I means I'll move with renaeming gensim.summarization to gensim.models.summarization_inner, and in __init__.py by default will be imported summarize, summarize_corpus and keywords (not moeving all to one file)

-1 on gensim.utils.utils -- doesn't look like a good naming scheme.

I agree, but If we import all functions from utils in gensim/utils/__init__.py, it will not make any difference for users (or move current gensim/utils.py to gensim/utils/__init__.py but I don't think that this approach is better)

piskvorky commented 7 years ago

Aha, so topic_coherence are some internal functions? And the actual model will live in another module?

I still don't understand the summarization part: where will the models/utility functions be? Same module/different modules? What is the package structure?

utils.utils is too generic. We could perhaps do utils.text_utils and utils.math_utils (etc). Still, we definitely want to ensure backward compatibility (import the functions from utils.__init__ so the "old way" continues to work).

parsing: OK. Better a part of text_utils? Does this need a separate module?

menshikh-iv commented 7 years ago

Aha, so topic_coherence are some internal functions? And the actual model will live in another module?

Yes, CoherenceModel in gensim.models

I still don't understand the summarization part: where will the models/utility functions be? Same module/different modules? What is the package structure?

Now structure here and used as from gensim.summarization import summarize, keywords, summarize_corpus

I suggest move gesnim/summarization -> gensim.models.summarization with imports in gensim/models/summarization_inner/__init__.py. New imports will look like

from gensim.models import summarize, keywords, summarize_corpus
from gensim.models.summarization import summarize, keywords, summarize_corpus

Split current utils is a good idea, but I don't think that I should make it now (I'll make it later in distinct PR). For now, I think gensim/utils/utils.py with imports in __init__.py is the best choice

parsing: OK. Better a part of text_utils? Does this need a separate module?

text_utils is OK for parsing, I agree.

piskvorky commented 7 years ago

I'm not an expert in packaging but that _inner stuff looks weird. To me, it is an indication the whole thing should be a (sub)package, gensim.models.summarization.

Why not simply:

 📂 gensim
    └📁 ...
    └📂 models
        └📂 summarization
            └📁 __init__.py
            └📁 some_utils_module.py
            └📁 class_module.py
            └📁 ...

?

menshikh-iv commented 7 years ago

@piskvorky For summarization - completely agree for coherence - _inner is better, because this functions for internal use only (it's like an explicit marker)

piskvorky commented 7 years ago

Internal use is marked with a _ prefix in Python, not _inner suffix. I still don't see why we should do that. CC @gojomo @jayantj thoughts?

menshikh-iv commented 7 years ago

Ok, let's use _

piskvorky commented 7 years ago

How do other projects denote "internal modules and packages"?

menshikh-iv commented 7 years ago

_ is default variant. I asked about _inner because we already have similar things (like word2vec_inner.*)

piskvorky commented 7 years ago

That's a file though (and a special one, not Python), not a package.

I don't feel strongly either way; I think we should follow the common practice here, the path of least surprise for our users.

menshikh-iv commented 7 years ago

Agree with you @piskvorky, let's continue my PR.

gojomo commented 7 years ago

re: _inner – I have not noticed other Python projects grouping "internal-only" functionality into its own distinctly-named module, though I may have just not noticed. Unless it's really longwinded & arcane stuff, I'd keep it alongside the more public classes/functions that it supports, and group/name by describable functionality (either exact functions, or roles like '_support' or '_extras' or '_utils') rather than just the more-aesthetic (and not really enforceable) 'outer'/surface vs 'inner' distinction.

piskvorky / gensim

Remove/refactor useless subpackages #1584