frankenstein-v2

Acknowledgements and License information

Metadata from Shelley-Godwin Archive: Creative Commons 1.0
Text processing: Datamuse API

About

The main objective of this project is to provide an accessible gold standard text for the authorship attribution of the 19th century novel Frankenstein. The text is constructed from the hand annotations of the draft in the Shelley-Godwin Archive (SGA). The TEI .xml files used in the SGA are multifunctional and as such they are not optimized for any single purpose. The files provided in this repository have been constructed with authorship attribution in mind. Some features:

Two .json list objects: one with stretches of subsequent text by the same author and one with the corresponding authors
Intelligent word parsing using the Datamuse API and dozens of heuristic rules (unfortunately the SGA annotations do not allow for trivial word parsing; as a result the text in this repo contains fewer parsing errors than the reading text on the SGA website)

As a secondary objective this project presents an initial comparison between a stylometric analysis of Frankenstein based on other work by Mary and Percy Shelley and the gold standard hand annotation. The analysis will consist of the following:

Principal Component Analysis: Influential words in the PCA of other work by Mary and Percy Shelley can be compared to the relative frequency of those words in parts of Frankenstein written by Mary and Percy respectively.
A rolling classification of Frankenstein can be attempted and compared to the hand annotation.

The tertiary objective of this project relies on the outcome of the rolling classification of Frankenstein. Initial tests show that a rolling classification with sample size of 1000 words and an overlap of 900 identifies an authorial shift to Percy towards the end of the novel. This is in line with the hand attribution by Charles Robinson. Interestingly, the rolling classification does not identify this change at a sample size of 5000 words and an overlap of 4500. This suggests that larger sample size may not always be better in authorship attribution of collaborative texts due to a decrease in resolution at larger sample sizes. In other words, smaller sample sizes may be used to increase resolution at the cost of accuracy.

Notes on composition

The presented text has been composed so as to resemble the 1818 edition of the novel while maintaining insight in the contribution of Percy Shelley. As such, the text is taken from the 1816-1817 draft up until the last few pages of Chapter 18. From that point onwards the text has been taken from the Fair Copy so that Percy's contributions to those final pages are reflected in the final text. As Robinson (2008, p. 29) notes:

As we move from the extant 1816-1817 Draft to the first edition of 1818, we note the following differences: minor changes that Mary Shelley made to the Draft when she fair-copied it; some substantial changes that Percy Shelley made to the Draft when he wrote out the last twelve-and-three-quarter pages of the Fair Copy;

Furthermore, as Robinson notes (2008, p. 41), the following sections are missing from the 1816-1817 draft:

from Volume I, the four introductory letters from Walton to his sister Margaret and the first part of Chapter 1; and from Volume II almost half of Chapter 3 and all of Chapter 4.

I have chosen not to replicate these sections from the 1818 version as we do not know who wrote them.

To do (crucial items in bold):

implement transpositions e.g. c57-0157, c57-0179, c57-0074, c57-0043, c56-0108, c56-0038
- needs more testing; check whether I didn't screw up subzone processing
- ~~transposition list should be made when a zone is first processed~~
- ~~when an element or line is encountered with an xml:id in the transposition list~~
- ~~the contents of that element should not be processed~~
- ~~the contents of the element corresponding to the first id in the transposition list should be processed~~
  - ~~if the element has a spanTo processing follows the subzone routine~~
  - ~~else (element is a line) --> processLine~~
- ~~the first id in the transposition list should be moved~~
- ~~if the original element has a spanTo~~
  - ~~delspan = True~~
  - ~~delspan_id = spanTo id~~
- ~~else (original element is a line) --> nothing~~
~~displacement not processed correctly in c56-0108 (see picture in article)~~
why no revision in 57-0097 he + f + inds
illustrate/plot sampling effect using actual results (proportion of pbs classifications at different sample sizes)
check whether macroF1 is better for smaller sample sizes (1500) than larger sample size (4000)
- should be true if larger sample sizes result in classification that is identical to baseline (majority) classification, because performance is already better than baseline
- note that macroF1 gives equal importance to performance in both classes, which does not flatter the performance of unbalanced datasets
add n-gram features
add POS features
~~implement more intelligent splitting rule of in postProcessing.py and rollingStylo.R~~
morning is separated into mor and ning in c57-0101
~~implement log file~~
~~composition of final text that resembles 1818 edition while maintaining insight in hand shifts: ch 18 of vol II to c57-0180 element 12 --> ch 7 of vol III from c58-0047 element 7~~
implement a check on number of Datamuse API calls that pauses the script for 24 hours
- record and update daily api calls in a file (so the script "remembers")
~~ignore chapter headings, e.g. 56-0081~~
~~remove quotation marks that are within words due to EOL split, e.g. in 56-0068~~
How do we handle notes by the compositor, e.g. 58-0037
- ~~use milestone annotation as delspan~~
- replace all comp annotations with mws annotations in post processing
add exception list for words that have a deviant spellings/unique words, e.g. massercring in 57-0039, interspered in 56-0122, precipieces in 56-0116, dissapeared in c56-0083
add 1818 edition lookup to regular join/separate algo?
implement tail text of mod inside of mod, e.g. 57-0039 or mod inside of add, currently only the order of non-hierarchical tags within a mod are handled correctly. What happens in 57-0039 is that only the tail of the nested mod gets printed because all child tags of the nested mod are only checked against the nested mod (the first mod upstream is used)
- ~~include add as a possible upstream parent tag~~
- ~~for each tag that is the final tag in its immediate parent:~~
- ~~print the tail text of the immediate parent (as we do now), but also~~
- ~~check whether the parent is itself nested in add/mod/hi, and if so~~
- ~~check its position within that parent by comparing the children of the nested add/mod/hi with the children of the parent add/mod/hi:~~
  - ~~if the final tags are identical, the tail text of the parent add/mod/hi should also be printed~~
  - ~~if the final tags are not identical, we don't have to do anything as the printing of the tail text will be handled when the final child (a sister to the nested add/mod/hi) is encountered~~
~~implement restoration, see guidelines~~
- ~~text in non-del tags within a del within a restore (implemented but not tested)~~
- ~~text in del tags within a restore (implemented and tested)~~
- ~~text in non-del tags within delSpan? (not encountered/implemented)~~
- ~~pages: '56-0051', '56-0106', '56-0114', '56-0122', '57-0033', '57-0039'(2x), '57-0057', '57-0082', '57-0087'(2x), '57-0096', '57-0104', '57-0116', '57-0145'(2x), '57-0158', '57-0166'~~
add fix for 56-0064?
~~implement cross-linear modifications (not implemented by website), e.g their experience & to feelings one another in 57-0019, see also guidelines~~
- ~~for deletions it doesn't matter, but for additions it does~~
- ~~does this occur across pages? (implemented but not tested)~~
- ~~addspan tags, e.g 56-0056~~
- ~~different zone on same page, e.g 56-0056~~
- ~~implement printing of (tail)text within children elements of <add> --> we need a processElement function which processes mod, add & hi elements~~
- ~~ignore references to another zone, processing of this text is already handled by anchors~~
- needs more testing, pages: '56-0011', '56-0005', '56-0006', '56-0008', '56-0014', '56-0010', '56-0022', '56-0034', '56-0038', '56-0075', '56-0079', '56-0082'(2x), '56-0084', '56-00104', '56-0110', '56-0113', '56-0116'(2x), '56-0128', '57-0013', '57-0015', '57-0019', '57-0031', '57-0030', '57-0039', '57-0043', '57-0045', '57-0070', '57-0077', '57-0111', '57-0146', '57-0147', '57-0148', '57-0168'
~~change join/separate algorithm to product of first and second part 56-0022: in + dulged should be indulged because score of dulged is 0~~
~~change word scoring algorithm, final score is product of score and frequency instead of average. This prevents the selection of non-words such as atthis in stead of at and this in 56-0032~~
- ~~fall back on old algorithm if the best score is 0 (due to one word with a score of 0)~~
~~add support for subzones from other pages: e.g. <anchor xml:id="c56-0089.01"/> in 56-0088 refers to <addSpan corresp="#c56-0089.01" spanTo="#c56-0089.02"/> in 56-0089~~
~~get latest page files from SGA repo~~
process text:
- if curline_par is a number and preceded by a non-number insert a space, e.g. in 58-0001 th + 17 --> joined
- if th, rd, st, nd is preceded by a number no space should be inserted
- ~~remove redundant newlines, spaces~~
- ~~handle EOL hyphens~~
- handle EOL + SOL, e.g. in 56-0068, in- + -supportable; and SOL hyphens, e.g. in 56-0115, dis + -turb; in 58-0049, hideous + -ness
- convert & to and in post processing
- handle capitalization, punctuation
- a full stop should be added when the first word of a line starts with an uppercase letter that is not I, a name, or part of initials / a title
- ~~a full stop should also be added before a milestone tag, which represent paragraph breaks~~
- necessary for POS tagging
~~support processing of individual pages (for debugging purposes)~~
maybe we need a double-checking mechanism after two consecutive JOIN/SEPARATE operations where the first operation's curline_part corresponds to the second operation's prevline_part
- ~~e.g. in 56-0111 app-SEP-r-JOI-oached --> app roached, and~~
- ~~e.g. in 56-0111 a-SEP-n-JOI-obscure --> a nobscure we check all possibilities once more~~
- ~~implementation:~~
- ~~register whether previous text also required a JOIN/SEPARATE operation in processText using a global variable~~
- ~~if so, check whether there are any whitespaces in previous_addition~~
- ~~if not, use regex to find the 'word' that precedes previous_addition in print_text~~
- ~~get score for all combos: "a", "b", "c", "ab", "bc", "abc"~~
- needs more testing / optimization / heuristics
- ~~if multiple combos have a word with a 0-score, check whether 0-score words exist in previous print_text/1st edition, e.g. in 58-0001 Lavenz + a + Geneva, Lavenza (i.e. ab_c) should be chosen~~
  - ~~if for none of the 0-score combos, the 0-score word can be found in print_text/1st edition, fall back on the old algorithm~~
- if the abc-option has a score higher than a certain threshold, it should always be chosen, e.g. in 56-0036 on + c + e should become once rather than on ce
- ~~handle punctuation marks or don't consider them in word look-up, e.g. —"Dam + n + "the in 56-0068~~
- maybe if part_a consists of an uppercase letter that is not I, we should not go through revision process, e.g. Lavenza in c58-0001 (names are not well recognized in Datamuse) or M + r + . --> M r. in 56-0068.
- ~~find heuristic that turns have + g + one into have gone rather than have g one in 56-0032~~
  - ~~check for single consonant in "a" and "b" parts~~
  - ~~if single consonant in "a", then "a_b_c" and "a_bc" are no longer an option~~
  - ~~if single consonant in "b" and "a" score is better than "ab" and "abc" score, then "a_bc" is the only option~~
- ~~find heuristic that turns by + the + desire into by the desire rather than bythe desire in 56-0012~~
  - ~~solved through change in scoring algorithm (product instead of average)~~
~~printing of tail text needs further adjustment: Laavenz instead of Lavenza in c58-0001 (nested <hi> tags) or the r of Mr ending up as M Krempe commenced ran eulogy of himself in 56-0068~~
check attribution accuracy
optimize dataMuse calls
- word definitions perhaps use &md=d to only give scores > 0 to words which have a definition
- In 56-0028 they is split up into the and y due to higher score of the compared to they. By incorporating frequency in algorithm this can be prevented: sp=they&md=f > sp=the&md=f & sp=y&md=f
- word context:
- sp=sametime has a higher score than the mean of sp=same and sp=time, but it has a lower score than the mean of sp=same&rc=time and sp=time&lc=same
- maybe even use the word before prevline_part: in 56-0022 the mean score of sp=be and sp=en is higher than the score of sp=been, but when we include the previous word have as context, sp=been&lc=have has a much higher score than sp=be&lc=have. Same with sp=me&lc=given and sp=mean&lc=given when considering me + an in 56-0022
  - ~~first get left context~~
  - ~~make left context an optional parameter in callDatamuse, so that we don't have to implement it in double-checking mechanism yet~~
- ~~if curline_part ends in a punctuation mark, ignore that mark when calling datamuse (this prevents incorrect separations)~~
~~add text from <unclear> and <damage> (e.g. 57-0111)~~ and <retrace> (e.g. 57-0013) tags
- ~~should we process text from <unclear>? See example in guidelines~~
- ~~only if text is 1 character (these are punctuation marks)~~
- ~~words are ommitted in 1818 version~~
~~add functionality for references to displaced text within same zone:~~
- ~~scan zone for displacements in processZone,~~
- ~~look for metamark function="displacement" with an xml:id~~
- ~~put displacement IDs into list~~
- ~~check for addSpans with those IDs in both processLine and processZone~~
- ~~if addSpan with displacement ID is found : put ID in delspan_id, delspan = True and break out~~
- ~~if metamark="displacement" with id (in either line or zone) is found, processSubZone~~
implement hand attribution
- ~~who is comp, e.g. in 58-0002 --> TEI ODD: unknown compositor~~
- ~~Types:~~
- ~~<add place="superlinear" hand="#pbs">power</add> --> processLine~~
- ~~<handShift new="#pbs"/> e.g. in 58-0053 --> processLine, processZone~~
- ~~<addSpan hand="#pbs" spanTo="#c56-0026.05"/> --> processLine, processZone (in processed elements, delspan and skipped lines)~~
- output:
- ~~list that consists of consecutive fragments with the same hand~~
- ~~list with same amount of elements and hand labels that correspond to fragments~~
- ~~.json format~~
- ask how hand changes within a word should be handled
  - allow them
  - or:
  - if only 1 hand change occurs within a word --> the whole word is attributed to the later hand
  - if the hand changes back again within a word, e.g. [and ][t][his problem] --> the whole word is attributed to the within-word change, e.g. [and ][this][ problem]
~~Implement limitations in volume files:~~
- ~~finish adding fromLine and toLine attributes in volume files~~
- ~~line counter for use with fromLine and toLine attributes~~
~~text within < hi > should not be printed if in < metamark >~~
~~add support for delspans that are initiated within other delspans~~
~~processLine should probably refactored so it differentiates between different levels of tags, but then we need a solution for tail text~~
- ~~solution is a processElement function that is called from processLine and itself~~
- ~~processline first creates a list of top-level children~~
- ~~for each allowed element type processElement is called~~
- ~~processElement does the following~~
  - ~~checks whether it is a del, metamark or anchor, if not~~
  - ~~print element text~~
  - ~~create a list of the element's children~~
  - ~~calls itself for each child~~
  - ~~prints element tail text~~
- ~~this way we solve the tail text problem using recursion~~
- ~~and we have an elegant solution for cross-line additions~~
- ~~processLine still handles anchors, delspans etc. providing these are never nested~~
  - ~~anchors and metamarks are sometimes nested; addspan and delspan aren't~~
  - ~~anchors (for all functions) and metamarks need to be handled in processElement~~
- Needs more testing:
- ~~run through all files~~
- hand attribution
  - ~~<add>~~
  - ~~<addSpan>~~
  - ~~<handShift>~~
  - test across pages
- test restore
- ~~cross-linear additions~~
- ~~displacement~~
  - ~~different zone~~
  - ~~different page, 56-0028~~
  - ~~subzone~~
  - ~~on same page, examples: 57-0024, 57-0009~~
  - ~~on different page, examples 56-0088~~
Do we want to correct shortcomings/mistakes of tei annotations or do we just follow the SGA reading text?
- using metamarks rather than anchors to reference displacements from another zone e.g. 56-0011 and 57-0103 '56-0012', '56-0025', '56-0031', '56-0039', '56-0045', '56-0048', '56-0058', '56-0059', '56-0060', '56-0063', '56-0069', '56-0071', '56-0071', '56-0076', '56-0077', '56-0079', '56-0082', '56-0083', '56-0087', '56-0088', '56-0093', '56-0099', '56-0111', '56-0112', '56-0113', '56-0115', '57-0005', '57-0010', '57-0012', '57-0021', '57-0022', '57-0037', '57-0037', '57-0038', '57-0040', '57-0041', '57-0041', '57-0042', '57-0049', '57-0059', '57-0074', '57-0098', '57-0159', '57-0161', '57-0169', '57-0183', '57-0183'

timjzee / frankenstein-v2

readme

frankenstein-v2

Acknowledgements and License information

About

Notes on composition

To do (crucial items in bold):