karst10607 / SyntaxEd

A syntax dial app to measure the processing time of Mad-Lib style game, quantifying one's ability and improvement in certain natural language syntax
0 stars 0 forks source link

Build up the first corpus for the experiment #3

Open karst10607 opened 8 years ago

karst10607 commented 8 years ago

I figure out I need to build the necessity for the experiment part by part. So the corpus and python NLTK powered parts need to be done in the first place.

I may migrate the corpus along with other corpora later in this year. The source text may be quite unbalanced at first. Randomly derived from the materials I can find.

But anyway, I need to tokenize them, and lemmatize them first. This corpus would be critical part for build my mad lib style syntax game.

karst10607 commented 8 years ago

- http://textminingonline.com/deep-learning-for-nlp-resources-from-github

karst10607 commented 8 years ago

This is how I manage to build a corpus for my app and thesis. I need to build a huge list of words in their lemma form. If possible, I need to separate them into different POS categories for future use in Mad Lib syntax game. I know words that derived from the rawdata will have duplicate issues, or in need of fixing. So the huge list will be stored in a .txt file, then convert into a frequency list. Then eventually, divide them up into several POS category list, saving them individually. So the user can use the base form of these lemmas, creating a syntactically correct sentences. The derivation form or suffixes can be provided later with the interface of the app. It will be shown like a turnable ring behind the lemma.

Another solution might be more precise but more difficult to comprehend. I know NLTK comes with a POS tagger, but it's complicated to use and even harder to group these marked words into my desired lists.

From https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/ to https://pythonprogramming.net/nltk-corpus-corpora-tutorial/

karst10607 commented 8 years ago

I figure out a key fact: I need to build the corpus based on NLTK POS tagger, and there's no other alternatives. Here's the process:

  1. Build the corpus with NLTK POS tagger. Text source could be built-in raw data in NLTK or some other random newly written news. (Novels may are not good source)
  2. Put them into Mad Lib game template with an advanced POS attributes. (the advanced POS generated by python is really complicated, so the template for a sentence would be complicated, too.)
  3. Generate a sentence randomly based on the above template. If the step.6 is completed, make another new random sentence from the corpus again until the assigned game round is over.
  4. User can now adjust the suffixes and derivational form to make the sentence grammatical
  5. Apply the grammar check to check user's adjust result
  6. It it passes the test, then stop the timer.
  7. If not, keep the timer running and notify user it's not yet grammatical.
  8. Collect the task complete time of, like 20 quizes. And see if user got improved during the game. (before and after the mnemonic device / synesthesia training)