Open menshikh-iv opened 7 years ago
@menshikh-iv Should all the scripts in the examples
directory be removed?
@souravsingh Yes, but I personally will works with this, because we need to do everything very carefully (I mean all issues from project)
Update
We discussed it with @macks22 in #1607 and I thinks we shouldn't miss good hierarchy, but we should do it more carefully and deeply, for this reason
[x] gensim.utils
should be split to several "subsubpackages":
gensim.utils.utils
for old gensim.utils
gensim.utils.text_utils
for gensim.parsing
[x] gensim.topic_coherence
should be moved with renaming to gensim.models._coherence
[x] gensim.summarization
should be moved to gensim.models.summarization
(without creation distinct model)
[x] gensim.parsing
should be cleaned and used as text_utils
in gensim.utils.text_utils
ะกะก @piskvorky @macks22 wdyt?
-1 on gensim.utils.utils
-- doesn't look like a good naming scheme.
gensim.utils.parsing
- what is the point? Why not simply remove it?
gensim.models.coherence_inner
- what is this about? Why inner
?
summarization
: OK to refactor to a single file under models
. Not sure what you mean by (without creation distrinct model)
.
gensim.utils.parsing
- what is the point? Why not simply remove it?
Because most part of this functions used in different places (mainly in notebooks), I'll remove part of this (that never used, but not all)
gensim.models.coherence_inner
- what is this about? Whyinner
?
Because for CoherenceModel
we need a lot of additional well-structured functions, inner
because it's for internal usage (in CoherenceModel
)
summarization
: OK to refactor to a single file undermodels
. Not sure what you mean by(without creation distinct model)
.
I means I'll move with renaeming gensim.summarization
to gensim.models.summarization_inner
, and in __init__.py
by default will be imported summarize
, summarize_corpus
and keywords
(not moeving all to one file)
-1 on gensim.utils.utils -- doesn't look like a good naming scheme.
I agree, but If we import all functions from utils in gensim/utils/__init__.py
, it will not make any difference for users (or move current gensim/utils.py
to gensim/utils/__init__.py
but I don't think that this approach is better)
Aha, so topic_coherence
are some internal functions? And the actual model will live in another module?
I still don't understand the summarization
part: where will the models/utility functions be? Same module/different modules? What is the package structure?
utils.utils
is too generic. We could perhaps do utils.text_utils
and utils.math_utils
(etc). Still, we definitely want to ensure backward compatibility (import the functions from utils.__init__
so the "old way" continues to work).
parsing
: OK. Better a part of text_utils
? Does this need a separate module?
Aha, so topic_coherence are some internal functions? And the actual model will live in another module?
Yes, CoherenceModel
in gensim.models
I still don't understand the summarization part: where will the models/utility functions be? Same module/different modules? What is the package structure?
Now structure here and used as from gensim.summarization import summarize, keywords, summarize_corpus
I suggest move gesnim/summarization
-> gensim.models.summarization
with imports in gensim/models/summarization_inner/__init__.py
.
New imports will look like
from gensim.models import summarize, keywords, summarize_corpus
from gensim.models.summarization import summarize, keywords, summarize_corpus
Split current utils
is a good idea, but I don't think that I should make it now (I'll make it later in distinct PR). For now, I think gensim/utils/utils.py
with imports in __init__.py
is the best choice
parsing: OK. Better a part of text_utils? Does this need a separate module?
text_utils
is OK for parsing
, I agree.
I'm not an expert in packaging but that _inner
stuff looks weird. To me, it is an indication the whole thing should be a (sub)package, gensim.models.summarization
.
Why not simply:
๐ gensim
โ๐ ...
โ๐ models
โ๐ summarization
โ๐ __init__.py
โ๐ some_utils_module.py
โ๐ class_module.py
โ๐ ...
?
@piskvorky
For summarization
- completely agree
for coherence
- _inner
is better, because this functions for internal use only (it's like an explicit marker)
Internal use is marked with a _
prefix in Python, not _inner
suffix. I still don't see why we should do that. CC @gojomo @jayantj thoughts?
Ok, let's use _
How do other projects denote "internal modules and packages"?
_
is default variant. I asked about _inner
because we already have similar things (like word2vec_inner.*
)
That's a file though (and a special one, not Python), not a package.
I don't feel strongly either way; I think we should follow the common practice here, the path of least surprise for our users.
Agree with you @piskvorky, let's continue my PR.
re: _inner
โ I have not noticed other Python projects grouping "internal-only" functionality into its own distinctly-named module, though I may have just not noticed. Unless it's really longwinded & arcane stuff, I'd keep it alongside the more public classes/functions that it supports, and group/name by describable functionality (either exact functions, or roles like '_support' or '_extras' or '_utils') rather than just the more-aesthetic (and not really enforceable) 'outer'/surface vs 'inner' distinction.
In gensim, we have many sub-packages, but several of this should be a part of another subpackage, another part is broken/useless and should be removed.
Candidates:
/examples
- Old broken code + confused users with name, should be removed./parsing
- Not relevant for gensim (all of this already implemented in NLTK/etc.), should be removed (but before, need to check that this isn't used).~/scripts
- Many scripts for Wikipedia + symlinks + some conversions + w2v_standalone. Need to look into all wiki scripts and understand why this needed, remove all that no needed (w2v_standalone too)./summarization
- need to refactor code and create one model, no need distinct subpackage for this.~/topic_coherence
- same assummarization
~nose.py
- unused runner for nose, should be removed.