Ok, so, some of this was covered in #4, which is likely worth revisiting, and these changes are in response to #11.
Here is the current state of things (with a fewer minor, functional tweaks):
do_svm.py will do a few, related things:
It will ensure that all the necessary folders exist, and exit (with a message) if not. (It creates the dir, but you probably have to put stuff in it. That's why it exits.)
It will create the database (project_name/db/svm.db) to hold the output.
It will process the files in splits_for_svm (same structure as splits... splits_for_svm/novel/chapters...). This involves removing TEI and some normalization.
Next, it will test itself by dividing the seen texts into testing and training and evaluating.
It will assess authorship likelihood for the seen texts and deposit this information in the db.
Next, if the db exists (spoiler: it probably does), we ask it he user wants to rebuild.
If not, move on to unseen tests.
If yes, do all the above work and then move on to unseen tests.
We now process all the texts in the testset folder against our model. The results are stored in the db.
From here, the best thing to do is to cd explore; python3 explore-svm..py
Choose the project you want to explore.
For option 1, you can see all the raw results from the seen set. (I'm pretty sure this needs to be massaged for display, as it's ugly/boring/unnecessary? Pick two.)
Honestly, the more interesting option is 2, to explore the unseen texts. This will let you choose a chapter, then it will give you a score against each of the seen authors. (0 - 1) It will even make a really boring bar graph. That looks like this (dracula ch. 10):
As for what's next, I honestly don't know. I get that the desire is to integrate these scores into the broader application, but I'm not sure there's room for an apples-to-apples comparison. (One model is guessing based on a set of features whether or not two texts were made by the same author, and the other is comparing an unknown text against a known author). But if you had to integrate them, then I guess something like this:
Set a threshold for yes or no. (Maybe re-use from the rest of the project?)
Score text. (I think we use the first 8 significant digits elsewhere, but would have to check.)
Check against threshold and known authors, using the Y or N to derive the YN/NY answers.
Then again, if the unseen texts' authors are not otherwise present in splits_for_svm, then all answers should be N or YN.
For this reason, and for a fair test, it seems like you'd want the author to already be part of the model, right?
From there, I guess you could pretend that this fits in with the jacquard calculations done elsewhere... but I'm still not convinced they do.
I now know how to do the apples to apples comparison, and it has to do with fitting the regression curve against known authors to find the correct weight. Starting it once I add this to main.
Ok, so, some of this was covered in #4, which is likely worth revisiting, and these changes are in response to #11.
Here is the current state of things (with a fewer minor, functional tweaks):
do_svm.py
will do a few, related things:From here, the best thing to do is to
cd explore; python3 explore-svm..py
As for what's next, I honestly don't know. I get that the desire is to integrate these scores into the broader application, but I'm not sure there's room for an apples-to-apples comparison. (One model is guessing based on a set of features whether or not two texts were made by the same author, and the other is comparing an unknown text against a known author). But if you had to integrate them, then I guess something like this:
From there, I guess you could pretend that this fits in with the jacquard calculations done elsewhere... but I'm still not convinced they do.