The content of this repository was created for an independent study completed durring spring semester of 2017. A complete writeup of this project can be found in the paper written about it: Semi-Supervised Labeling and Classification of Words by Semantic Subject
Goal:
(deprecated) Build WordVectors for Classification
source
performs web-scraping of websites, tokenization of scrapped data, recording into database, and exporting lists of tokenized words of any size (can choose 1M or 12M words for export) and any tokenization type (keep stop words, remove stop words, etc). Classify words based on subject
Implementation :
features
directory contains all functionality relating to converting source
exports to embeddings, analyzing embeddings, and labeling words - to create features that can be used in classification. classify
contains a classification 'pipeline', including:
###########################
## Add split arguments - the splits will be generated and all will be used for each classification_argument set
###########################
split_arguments.append({
"sampling" : ["SMOTE"],
"SM" : [3, 8, 15],
})
split_arguments.append({
"sampling" : ["over"],
"SM" : [3, 8, 15],
})
#############################
## Add classification arguments
###########################
classification_arguments.append({
"classifier_choice" : ["rf"],
"rtrue" : [1, 5, 10, 20, 30, 40, 50 ],
})
classification_arguments.append({
"classifier_choice" : ["nn"],
"epochs" : [400],
"learning_rate" : [0.1, 0.025],
"n_hidden_1" : [40, 20, 10, 5, 2],
"n_hidden_2" : [40, 20, 10, 5, 2],
"rtrue" : [1, 10, 30, 50],
})
classification_arguments.append({
"classifier_choice" : ["svm"],
"kernel" : ["linear"],
})
Starting with a subject of "plants" as in "houseplants", "gardening", etc.