jeff1evesque / ist-652

Syracuse IST-652 Final Project
1 stars 3 forks source link

#38: Reduce sparse feature set #39

Closed jeff1evesque closed 6 years ago

jeff1evesque commented 6 years ago

Resolves #38.

jeff1evesque commented 6 years ago

Unfortunately, our multi chi-square scenario all produced the same result:

[[ 0  0  0  0  0  0 31]
 [ 0  0  0  0  0  0  5]
 [ 0  0  0  0  0  0  5]
 [ 0  0  0  0  0  0 26]
 [ 0  0  0  0  0  0 11]
 [ 0  0  0  0  0  0 17]
 [ 0  0  0  0  0  0 39]]
error rate: 0.708955223880597
/usr/local/lib/python3.5/dist-packages/matplotlib/figure.py:448: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
  % get_backend())
[[ 0  0  0  0  0  0 31]
 [ 0  0  0  0  0  0  5]
 [ 0  0  0  0  0  0  5]
 [ 0  0  0  0  0  0 26]
 [ 0  0  0  0  0  0 11]
 [ 0  0  0  0  0  0 17]
 [ 0  0  0  0  0  0 39]]
error rate: 0.708955223880597
[[ 0  0  0  0  0  0 31]
 [ 0  0  0  0  0  0  5]
 [ 0  0  0  0  0  0  5]
 [ 0  0  0  0  0  0 26]
 [ 0  0  0  0  0  0 11]
 [ 0  0  0  0  0  0 17]
 [ 0  0  0  0  0  0 39]]
error rate: 0.708955223880597
[[ 0  0  0  0  0  0 31]
 [ 0  0  0  0  0  0  5]
 [ 0  0  0  0  0  0  5]
 [ 0  0  0  0  0  0 26]
 [ 0  0  0  0  0  0 11]
 [ 0  0  0  0  0  0 17]
 [ 0  0  0  0  0  0 39]]
error rate: 0.708955223880597
[[ 0  0  0  0  0  0 31]
 [ 0  0  0  0  0  0  5]
 [ 0  0  0  0  0  0  5]
 [ 0  0  0  0  0  0 26]
 [ 0  0  0  0  0  0 11]
 [ 0  0  0  0  0  0 17]
 [ 0  0  0  0  0  0 39]]

This is likely an indication that the combination of category was poorly constructed, as well as a limited number of articles. For example the 20 news group dataset, is one comprising of roughly 20,000 news articles. However, in our case, we originally harvested 500 articles, then manually labelled each article a specific category. Any article with a "category": "other" was removed. This meant that the remaining articles used for the train + test was less than 150 articles. This is not a sufficient amount data for modeling.