Text-level statistics: wishlist

JackWilton1594 commented 3 months ago

Hi team,

As per Martin's request, here's what I'm thinking would be useful for text-level statistics.

Count of word-types by character
Percentage of total dialogue spoken by character (calculated by word-token); total dialogue should include prologues, epilogues, and similar
Visualisations of the same (e.g. bar charts; pie chart for distribution of total dialogue)

I've not checked, but being able to sort the statistics by column would also be neat.

In the longer term, a good project for an RA might be to explore generating doubling charts for early performances. Depending on the company attribution, this would require additional metadata for characters (i.e., gender and age) to distinguish between boy and adult actors in the case of adult companies (plays for boys companies won't require this distinction). The chart should capture presence/absence of characters in each scene, whether they speak or not. The user should be able to assign the number of players for distribution of roles.

Brett

martindholmes commented 3 months ago

@JackWilton1594 Could you define word-types? Bearing in mind that LEMDO doesn't have natural language parsing of any kind built into it right now, and the only way I know to determine word types is by using NLP tools, if the intention is to categorize words as nouns, verbs, adjectives etc., then either the editors will have to tag them as such, or every editor will need to install a suite of NLP tools to accomplish this. We would also probably have to customize the NLP tools considerably to take account of poetic language and EME forms. Since this is not likely to be something that many individual editors want, I wonder if it would be better to generate a version of the text which is optimized such that any editor who wants to do NLP could then feed it into the tools of their choice to do the work they need to do. For example, we could provide a single text file containing all the speech by each individual character, and then you could use whatever NLP suite you're most comfortable with to generate the stats you need.

The other stuff seems practical, given that some additional encoding features are added by an editor (presence/absence of any given character in any given scene, for instance). One approach to this might be to use TEI's declarable elements mechanism:

https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CC.html#CCAS

where any element such as a scene <div> could use its @decls element to point to the roles currently onstage; however, a problem there is that although <listPerson> is a member of att.declarable, <person> is not, so we couldn't (at present) use this mechanism to point to <person> elements in the role list in the header, which would be ideal. If this seems like a good strategy, we could raise an issue on the TEI repo to add <person> to att.declarable, and in the meantime make that a LEMDO customization.

martindholmes commented 3 months ago

As of rev 17907, I've added the expansion of contractions, as well as counts of distinct terms in addition to all tokens. The stats table is now sortable.

projectLEMDO / lemdoIssues

Text-level statistics: wishlist #224