Generalize QuestionXXX dataset

jeff1evesque commented 5 years ago

We need to generalize the implementation of the QuestionXXX dataset. Specifically, we need to be able to handle any xxx number of parts of speech. Additionally, we may consider applying a logarithmic function to our arbitrary assignment of our penn_scale:

penn_scale = {
    'CC': 1,
    'CD': 2,
    'DT': 3, 
    'EX': 4,
    'FW': 5,
    'IN': 6,
    'JJ': 7,
    'JJR': 8,
    'JJS': 9,
    'LS': 10,
    'MD': 11,
    'NN': 12,
    'NNS': 13,
    'NNP': 14,
    'NNPS': 15,
    'PDT': 16,
    'POS': 17,
    'PRP': 18,
    'PRP$': 19,
    'RB': 20,
    'RBR': 21,
    'RBS': 22,
    'RP': 23,
    'SYM': 24,
    'TO': 25,
    'UH': 26,
    'VB': 27,
    'VBD': 28,
    'VBG': 29,
    'VBN': 30,
    'VBP': 31,
    'VBZ': 32,
    'WDT': 33,
    'WP': 34,
    'WP$': 35,
    'WRB': 36
}

Lastly, we may need to abstract the logic required to generate the above associated classifier into utility functions outside jupyter notebook. However, we can import them back into the notebook for utilization.

jeff1evesque commented 5 years ago

One attempt is to reindex the values, possibly at increments at 2. However, parts of speech (pos) more closely related to questions can be assigned the largest values (doesn't have to be increment of 2). Similarly, words least related (including stop words) will receive the lowest. Finally, we could smooth out the scale with logarithmic function.

jeff1evesque commented 5 years ago

We don't have time to explore the different combination of the pos (positions relative to one another), with different incremental value, then checking if a logarithmic function is suitable. Instead, we'll apply the simplest solution, which will assume most sentences are 20-30 words is length. So, we'll allow 40 words, which means sentences that do not have this limit will be padded with a fixed constant. Sentences with more than 40 words, will be removed from after the 40th word. In a future extension, applying a logarithmic function will help ease the padding constant. However, current estimation at mid 74% suffices.

jeff1evesque / ist-664

Generalize QuestionXXX dataset #64