voyanttools / trombone

GNU General Public License v3.0
3 stars 2 forks source link

Stop filtering on minRawFreq before distribution statistics are calcu… #44

Closed recrm closed 1 month ago

recrm commented 1 month ago

resolves #38,

When runAllTermsWithDistributionsDocumentTermVectors runs it filters on minRawFreq twice. At first it removes all instances of the word that don't hit the minRawFreq at the document level. Then it calculates the statistics. Finally it removes all words that don't hit the minRawFreq at the corpus level.

I fundamentally believe this behavior is confusing as words are not filtered at the document level if runAllTermsWithoutDistributions is run. Resulting in conflicting and confusing numbers.

The fix I propose is simple. We just remove the document level filter in runAllTermsWithDistributionsDocumentTermVectors, and leave everything else as is.