htrc / torchlite-app

torchlite-app.vercel.app
0 stars 0 forks source link

As an end user, I want to be able to clean my workset #18

Open cwulfman opened 1 year ago

cwulfman commented 1 year ago

See:

On the dashboard is a checkbox menu of eight choices:

  1. Apply Stopwords
  2. Make Lowercas 3. Lemmatize 4. Stem 5. Search and Replace
  3. Page Features
  4. POS Tags
  5. Token Count Limits

As for the fltering options, the dashboard does not have an "Apply" button, so the front end, to be idempotent, needs to send a PUT request containing the current set of selected cleaners each time one of the buttons is pressed.

Some of these operations (like Stem and Lemmatize) are standard NLP operations on tokens; others (like POS Tags and Token Count Limits) seem to be filters (e.g., include in the full token list only tokens that are Nouns; include in the token list only tokens that occur more that ten times in the dataset; exclude all tokens that appear in a stop list). Others, like Make Lowercase and Search and Replace, are basic string operations, but they alter the token data.

"Page Features" refers to the header, footer, and body sections in the EF data; users should be able to choose which of these sections to include in their analyses.

To implement this feature, it will be prudent to break the task down into individual sub-features, by category:

suggestion: a noise filter would probably be a useful thing : a filter that used a dictionary (or dictionaries, in the case of multilingual corpora) to throw out token-strings that are not found in the dictionary.

jswatsch commented 1 year ago

HTRC discussed and want to include the following choices:

Apply Stopwords Make Lowercase Lemmatize Stem Search and Replace Page Features POS Tags Token Count Limits

Clarification on "page features". Page features is a selector between header, body, and footer.

No noise filter at this time.