Open cwulfman opened 1 year ago
HTRC discussed and want to include the following choices:
Apply Stopwords
Make Lowercase
Lemmatize
Stem
Search and Replace
Page Features
POS Tags
Token Count Limits
Clarification on "page features". Page features is a selector between header, body, and footer.
No noise filter at this time.
See:
33
34
35
36
37
On the dashboard is a checkbox menu of eight choices:
3. Lemmatize4. Stem5. Search and ReplaceAs for the fltering options, the dashboard does not have an "Apply" button, so the front end, to be idempotent, needs to send a PUT request containing the current set of selected cleaners each time one of the buttons is pressed.
Some of these operations (like Stem and Lemmatize) are standard NLP operations on tokens; others (like POS Tags and Token Count Limits) seem to be filters (e.g., include in the full token list only tokens that are Nouns; include in the token list only tokens that occur more that ten times in the dataset; exclude all tokens that appear in a stop list). Others, like Make Lowercase and Search and Replace, are basic string operations, but they alter the token data.
"Page Features" refers to the header, footer, and body sections in the EF data; users should be able to choose which of these sections to include in their analyses.
To implement this feature, it will be prudent to break the task down into individual sub-features, by category:
suggestion: a noise filter would probably be a useful thing : a filter that used a dictionary (or dictionaries, in the case of multilingual corpora) to throw out token-strings that are not found in the dictionary.