Open fcharras opened 6 years ago
Well, the documentation pretty much states that -
no_above : float, optional
-- Keep tokens which are contained in no more than no_above
documents
So, there should be a "greater or equal" sign rather than "strictly greater".If this's what you really want to know.
I see, then I am probably wronged by the default value to 0.5
, I didn't think that None
is a valid value.
Sorry, reopening the issue because None
is not a valid value, so it's not optional ?
Of course ,None
is not a valid value. no_above
being optional states only that you may or may not provide this value(default being0.5
). Specifying None should give TypeError
.
Ok, I thought optional
means that filtering tokens that are contained in more than no_above
documents is optional, but it means that the function will run without error if it's not provided, while still filtering words. IMO it would be appreciated to have an option to disable either of the filters (min_count
or no_above
) as it is possible for keep_n
:
(or keep all if None)
I didn't seem obvious to me :thinking:
@fcharras all works fine, if you want to disable one (or both), you should use something like dct.filter_extremes(no_below=0, no_above=1.)
I agree, it is non-obvious. In what way is that parameter optional?
If optional is meant only as a synonym for "keyword argument with default value", then that's unnecessary information. It's already obvious from the method signature (including the actual default value -- and that value is "non-optional").
@menshikh-iv if the new documentation tries to use optional
in place of has some default value
, I'd suggest we get rid of optional
everywhere.
For me, "optional" is superfluous information, yes, or I didn't understand something? If yes - what's optional
means?
Then we agree. Let's remove that confusing unnecessary information (here and elsewhere).
@piskvorky wait, first of all, we need to understand, what this means (before deleting something).
I don't know who wrote the documentation -- best ask them.
To me, that parameter has a default value (which is clear from the docs) and is not "optional" in any way worth explicitly mentioning.
@piskvorky I ask you, in your opinion, what means "optional"?
It means different things in different context. Here, as a user, I'd expect "optional" to mean that if I don't supply it, no clipping is done (the clipping is optional). I wouldn't expect it to mean "has a default value that clips anyway even if not supplied".
Which is how @fcharras understood it too, if I'm reading this correctly.
Not a big deal, but that word "optional" is just adding potential confusion while adding nothing of value: the "defaultness" is already clear from the method signature, in fact even more clearly because you can see the actual default value there.
That was my point indeed thank you very much for clarifying.
If we agree on that, I'd like to highlight again why I opened the ticket: the option no_above=1
still filter some words (the words that appear at least once in each document), and that added to my confusion, that's why I was suggesting to use a strict inequality, so that no_above=x
keep only words that appear in stricly less than no_above
documents. Currently the inequality is soft.
(also, I thought that setting no_above=2
was still triggering this issue, but after the code it shouldn't be the case so that was probably a mistake on my side).
1.0 should not filter anything IMO (no upper limit). If it does, then that is a bug.
@fcharras can you show me the small example? I can't reproduce your problem.
@piskvorky
It means different things in different context. Here, as a user, I'd expect "optional" to mean that if I don't supply it, no clipping is done (the clipping is optional). I wouldn't expect it to mean "has a default value that clips anyway even if not supplied".
In this case, this isn't clear for "common" case, where optional
should be mentioned? This is same as "parameter can be None"?
Parameter can be None.
is not a meaningful message. The documentation is primarily there to describe what a parameter does (semantics), why it's there, what are its effects and reasonable values.
As a user, seeing Parameter can be None.
tells me nothing (though it's not as bad as a misleading Optional
).
Hello everybody. We documented the function using numpydoc style, so here's the link to the styleguide
Here's the part concerning the optional
keyword:
I think that the problem here is not with the optional keyword, but with the absence of the clear notion about how to disable filtration altogether. I personally think that default values should be left as-is because it's easier for an inexperienced user to just use some near-optimal values instead of trying to find their own.
I suggest we just mention how to disable the filtration in the docs.
Default values will certainly be left as-is. This discussion is only about the documentation.
Repeating the same information twice (method and docstring) violates DRY, and for this reason I don't like it. It is a recipe for maintenance headaches.
If we really want to repeat the same information twice in different places, we should at least make the manual description sufficiently useful = like you say, explain what the default is and why, and not just reiterate that a "default exists and it equals X" (or even worse, "default exists").
Docstrings are meant to be read by humans, not computers, so explaining the semantics of the parameters and offering insight into their values are more relevant than syntax.
@piskvorky I mostly agree, though optional keyword next to an argument is IMHO more readable than this
As for DRY, yes, in a perfect world those optional
keywords and default values would be generated automatically by Sphinx or something like that, but here we are. It's a little bit ugly, but improves readability quite a bit. By the way, scikit-learn's RandomForest contains information about what default params mean, just like you suggest.
Description
filter_extremes
whenno_above = 1
still filters the words that appear at least one time in all the documents. It should filter no word at all (that means replacing a "greater or equal" with a "strictly greater" somewhere).