This PR switches from a static list of stopwords provided in a file to a dynamic list identified from the corpus being processed. The stopwords list is now generated by selecting a configurable number of terms with the lowest IDF. Additionally, the stopwords are filtered out when featurizing sentences, rather than in the document segmentation and tokenization step.
This PR switches from a static list of stopwords provided in a file to a dynamic list identified from the corpus being processed. The stopwords list is now generated by selecting a configurable number of terms with the lowest IDF. Additionally, the stopwords are filtered out when featurizing sentences, rather than in the document segmentation and tokenization step.
Resolves #31.