piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.56k stars 4.37k forks source link

dictionnary.filter_extremes with no_above = 1 still filter words that appear in every documents. #1938

Open fcharras opened 6 years ago

fcharras commented 6 years ago

Description

filter_extremes when no_above = 1 still filters the words that appear at least one time in all the documents. It should filter no word at all (that means replacing a "greater or equal" with a "strictly greater" somewhere).

morsdor commented 6 years ago

Well, the documentation pretty much states that - no_above : float, optional -- Keep tokens which are contained in no more than no_above documents So, there should be a "greater or equal" sign rather than "strictly greater".If this's what you really want to know.

fcharras commented 6 years ago

I see, then I am probably wronged by the default value to 0.5, I didn't think that None is a valid value.

fcharras commented 6 years ago

Sorry, reopening the issue because None is not a valid value, so it's not optional ?

morsdor commented 6 years ago

Of course ,None is not a valid value. no_above being optional states only that you may or may not provide this value(default being0.5). Specifying None should give TypeError.

fcharras commented 6 years ago

Ok, I thought optional means that filtering tokens that are contained in more than no_above documents is optional, but it means that the function will run without error if it's not provided, while still filtering words. IMO it would be appreciated to have an option to disable either of the filters (min_count or no_above) as it is possible for keep_n :

(or keep all if None)

I didn't seem obvious to me :thinking:

menshikh-iv commented 6 years ago

@fcharras all works fine, if you want to disable one (or both), you should use something like dct.filter_extremes(no_below=0, no_above=1.)

piskvorky commented 6 years ago

I agree, it is non-obvious. In what way is that parameter optional?

If optional is meant only as a synonym for "keyword argument with default value", then that's unnecessary information. It's already obvious from the method signature (including the actual default value -- and that value is "non-optional").

@menshikh-iv if the new documentation tries to use optional in place of has some default value, I'd suggest we get rid of optional everywhere.

menshikh-iv commented 6 years ago

For me, "optional" is superfluous information, yes, or I didn't understand something? If yes - what's optional means?

piskvorky commented 6 years ago

Then we agree. Let's remove that confusing unnecessary information (here and elsewhere).

menshikh-iv commented 6 years ago

@piskvorky wait, first of all, we need to understand, what this means (before deleting something).

piskvorky commented 6 years ago

I don't know who wrote the documentation -- best ask them.

To me, that parameter has a default value (which is clear from the docs) and is not "optional" in any way worth explicitly mentioning.

menshikh-iv commented 6 years ago

@piskvorky I ask you, in your opinion, what means "optional"?

piskvorky commented 6 years ago

It means different things in different context. Here, as a user, I'd expect "optional" to mean that if I don't supply it, no clipping is done (the clipping is optional). I wouldn't expect it to mean "has a default value that clips anyway even if not supplied".

Which is how @fcharras understood it too, if I'm reading this correctly.

Not a big deal, but that word "optional" is just adding potential confusion while adding nothing of value: the "defaultness" is already clear from the method signature, in fact even more clearly because you can see the actual default value there.

fcharras commented 6 years ago

That was my point indeed thank you very much for clarifying.

If we agree on that, I'd like to highlight again why I opened the ticket: the option no_above=1 still filter some words (the words that appear at least once in each document), and that added to my confusion, that's why I was suggesting to use a strict inequality, so that no_above=x keep only words that appear in stricly less than no_above documents. Currently the inequality is soft.

(also, I thought that setting no_above=2 was still triggering this issue, but after the code it shouldn't be the case so that was probably a mistake on my side).

piskvorky commented 6 years ago

1.0 should not filter anything IMO (no upper limit). If it does, then that is a bug.

menshikh-iv commented 6 years ago

@fcharras can you show me the small example? I can't reproduce your problem.

@piskvorky

It means different things in different context. Here, as a user, I'd expect "optional" to mean that if I don't supply it, no clipping is done (the clipping is optional). I wouldn't expect it to mean "has a default value that clips anyway even if not supplied".

In this case, this isn't clear for "common" case, where optional should be mentioned? This is same as "parameter can be None"?

piskvorky commented 6 years ago

Parameter can be None. is not a meaningful message. The documentation is primarily there to describe what a parameter does (semantics), why it's there, what are its effects and reasonable values.

As a user, seeing Parameter can be None. tells me nothing (though it's not as bad as a misleading Optional).

anotherbugmaster commented 6 years ago

Hello everybody. We documented the function using numpydoc style, so here's the link to the styleguide

Here's the part concerning the optional keyword:

image

I think that the problem here is not with the optional keyword, but with the absence of the clear notion about how to disable filtration altogether. I personally think that default values should be left as-is because it's easier for an inexperienced user to just use some near-optimal values instead of trying to find their own.

I suggest we just mention how to disable the filtration in the docs.

piskvorky commented 6 years ago

Default values will certainly be left as-is. This discussion is only about the documentation.

Repeating the same information twice (method and docstring) violates DRY, and for this reason I don't like it. It is a recipe for maintenance headaches.

If we really want to repeat the same information twice in different places, we should at least make the manual description sufficiently useful = like you say, explain what the default is and why, and not just reiterate that a "default exists and it equals X" (or even worse, "default exists").

Docstrings are meant to be read by humans, not computers, so explaining the semantics of the parameters and offering insight into their values are more relevant than syntax.

anotherbugmaster commented 6 years ago

@piskvorky I mostly agree, though optional keyword next to an argument is IMHO more readable than this

image

As for DRY, yes, in a perfect world those optional keywords and default values would be generated automatically by Sphinx or something like that, but here we are. It's a little bit ugly, but improves readability quite a bit. By the way, scikit-learn's RandomForest contains information about what default params mean, just like you suggest.