Open karst10607 opened 8 years ago
This is how I manage to build a corpus for my app and thesis. I need to build a huge list of words in their lemma form. If possible, I need to separate them into different POS categories for future use in Mad Lib syntax game. I know words that derived from the rawdata will have duplicate issues, or in need of fixing. So the huge list will be stored in a .txt file, then convert into a frequency list. Then eventually, divide them up into several POS category list, saving them individually. So the user can use the base form of these lemmas, creating a syntactically correct sentences. The derivation form or suffixes can be provided later with the interface of the app. It will be shown like a turnable ring behind the lemma.
Another solution might be more precise but more difficult to comprehend. I know NLTK comes with a POS tagger, but it's complicated to use and even harder to group these marked words into my desired lists.
From https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/ to https://pythonprogramming.net/nltk-corpus-corpora-tutorial/
I figure out a key fact: I need to build the corpus based on NLTK POS tagger, and there's no other alternatives. Here's the process:
I figure out I need to build the necessity for the experiment part by part. So the corpus and python NLTK powered parts need to be done in the first place.
I may migrate the corpus along with other corpora later in this year. The source text may be quite unbalanced at first. Randomly derived from the materials I can find.
But anyway, I need to tokenize them, and lemmatize them first. This corpus would be critical part for build my mad lib style syntax game.