jeff1evesque / ist-664

Syracuse IST-664 Final Project with Chris Wilson (team member)
2 stars 3 forks source link

Implement syntactic parser to predict questions #62

Closed jeff1evesque closed 5 years ago

jeff1evesque commented 5 years ago

We need to implement a syntactic parser to determine whether a sentence is an inquiry. Additionally, we need to reinstate our mongodb code which was accidentally removed.

jeff1evesque commented 5 years ago

The current random forrest is 79% accurate to classify the penn tree syntactic parser. However, we only used 1/3 of the provided data (i.e. S10). Additionally, we could probably do a better job at normalizing our arbitrary penn_scale possibly with some logarithmic function:

penn_scale = {
    'CC': 1,
    'CD': 2,
    'DT': 3, 
    'EX': 4,
    'FW': 5,
    'IN': 6,
    'JJ': 7,
    'JJR': 8,
    'JJS': 9,
    'LS': 10,
    'MD': 11,
    'NN': 12,
    'NNS': 13,
    'NNP': 14,
    'NNPS': 15,
    'PDT': 16,
    'POS': 17,
    'PRP': 18,
    'PRP$': 19,
    'RB': 20,
    'RBR': 21,
    'RBS': 22,
    'RP': 23,
    'SYM': 24,
    'TO': 25,
    'UH': 26,
    'VB': 27,
    'VBD': 28,
    'VBG': 29,
    'VBN': 30,
    'VBP': 31,
    'VBZ': 32,
    'WDT': 33,
    'WP': 34,
    'WP$': 35,
    'WRB': 36
}

Since it only took less than a quarter of a second, I'm not too worried using a less performant algorithm.

jeff1evesque commented 5 years ago

I actually concatenated the remaining two datasets, then bumped up the number or random forrest iterator from 10, to 1000. The accuracy dropped from 79 to 72%. One of the biggest difference with this ammended dataset was that the largest sentence was 33 parts of speech. When the datasets was merged, the largest sentence contained almost 70 parts of speech. By accounting more data, we've introduced the capacity of having more outliers. So, we need to decide whether to implement a cutoff point for the number of parts of speech, or apply some kind of logarithmic function to smooth out the weight (this would decrease the default 35 for None values). It seems that reducing our data to our earlier 79% seems reasonable. But, a smaller dataset could likely correspond to overfitting.

jeff1evesque commented 5 years ago

By accounting more data, we've introduced the capacity of having more outliers.

Additionally, allowing bigger outliers also introduces a bigger gap between the spread of data. This means the gap will be filled with the value 35 in our above arbitrary penn_scale.