As requested, this PR swaps out authors as the model in our SVM code in favor of the novels. (As you'd expected, a wider range of computed scores results when SVM is weight heavily... a novel is going to be a fuzzier thing than a collection of novels from an author.) Overall, though, the results on make_auto_scatterplot.py seem in line with what was expected.
Other notes:
chapter comparison code was removed from do_svm.py. It shouldn't have been there, and is actually nonsensical. Comparisons are now rightly handled in auto_author_prediction.py
consequently, the svm.db is way smaller. Even for Eltec.
the splits_for_svm chapters are no longer needed, and have been removed.
author-based do_svm.py is in arch/. Please leave it there.
database_ops.py has a new function to get all the 'novel' names. In this case, a 'novel name' is the splits dir, broken at the hyphen. This is needed for consistent comparisons, and -- as we only allow one novel per dir -- seems to work fine.
Breaking change: explore-svm.py needs refactoring. Please use other means of reading svm.db for now!
Please test before merging. Also, start with ./begin.sh to ensure clean run.
As requested, this PR swaps out authors as the model in our SVM code in favor of the novels. (As you'd expected, a wider range of computed scores results when SVM is weight heavily... a novel is going to be a fuzzier thing than a collection of novels from an author.) Overall, though, the results on make_auto_scatterplot.py seem in line with what was expected.
Other notes:
Please test before merging. Also, start with
./begin.sh
to ensure clean run.