When runAllTermsWithDistributionsDocumentTermVectors runs it filters on minRawFreq twice. At first it removes all instances of the word that don't hit the minRawFreq at the document level. Then it calculates the statistics. Finally it removes all words that don't hit the minRawFreq at the corpus level.
I fundamentally believe this behavior is confusing as words are not filtered at the document level if runAllTermsWithoutDistributions is run. Resulting in conflicting and confusing numbers.
The fix I propose is simple. We just remove the document level filter in runAllTermsWithDistributionsDocumentTermVectors, and leave everything else as is.
resolves #38,
When runAllTermsWithDistributionsDocumentTermVectors runs it filters on minRawFreq twice. At first it removes all instances of the word that don't hit the minRawFreq at the document level. Then it calculates the statistics. Finally it removes all words that don't hit the minRawFreq at the corpus level.
I fundamentally believe this behavior is confusing as words are not filtered at the document level if runAllTermsWithoutDistributions is run. Resulting in conflicting and confusing numbers.
The fix I propose is simple. We just remove the document level filter in runAllTermsWithDistributionsDocumentTermVectors, and leave everything else as is.