Open GoogleCodeExporter opened 9 years ago
Are you suggesting that if a 1-gram is part of a 2-gram, then the 1-gram should
not be listed at all?
Original comment by richard.eckart
on 29 Oct 2014 at 3:38
No.
I am suggesting that - due to the strong dependencies between counts - counts
should be made independently for each ngram level.
Otherwise, if you want e.g. 100 trigrams in your features, you will also have
something like 800 bigrams and 5000 unigrams as features, which is not what
most people would expect here.
Original comment by torsten....@gmail.com
on 29 Oct 2014 at 3:43
This is an understood design feature. I think the idea was, higher-N ngrams
are so sparse, that we don't want them to occupy feature space unless they are
actually frequent enough to compete with lower-N ngrams. If someone wants lots
of trigrams, they can constrain to minN=3, maxN=3.
Of course, it's always nice to have more control.
We could rewrite all the ngram classes, or we could tackle Issue 39 allowing
multiple copies of the same FE in a single experiment, which would also solve
the problem and actually permit much finer control (and would not change
previous experiments, and would not risk overengineering the ngram FE classes).
I vote for #2.
Original comment by EmilyKJa...@gmail.com
on 29 Oct 2014 at 4:55
I am with Emily's proposal, as solution #2 would allow what Torsten suggested
while at the same time, old experiments will still be reproducible.
Original comment by daxenber...@gmail.com
on 29 Oct 2014 at 5:23
Original comment by daxenber...@gmail.com
on 11 Dec 2014 at 3:48
Original issue reported on code.google.com by
torsten....@gmail.com
on 29 Oct 2014 at 3:25