syvwlch / Data-Ignota

A data-driven exploration of Ada Palmer's Terra Ignota series
https://syvwlch.github.io/Data-Ignota/
MIT License
3 stars 0 forks source link

[Feature] Add Scenes to Chapters #23

Closed syvwlch closed 2 years ago

syvwlch commented 2 years ago

Is your feature request related to a problem? Please describe.

Provide add'l context about location and characters present.

Describe the solution you'd like

Add a new type of milestone node in the digital edition which marks a change in location and/or characters present.

Need to do some research on the best way to encode this using TEI.

Describe alternatives you've considered

Add metadata to chapter <div> but the context changes within a chapter.

Additional context

This is similar to how a stage play is broken up.

syvwlch commented 2 years ago

Note: add the unique ID of the scene containing the line of dialog to said.csv, to allow joins.

syvwlch commented 2 years ago

Doing some research on how to encode the scenes within TEI schema.

The canonical way is to use the <milestone type="scene"/> empty node to flag the beginning of each scene, similar to page-breaks.

However, you can't just add attributes for location and persons present to the milestone node. If you want to add that kind of metadata, looks like you have to move it off to <standOff> like <person> or <org> and point with a ref attribute... but in the case of scenes that is very inefficient, as there will always be a one-to-one relationship.

syvwlch commented 2 years ago

I have already used ana attribute as a substitute ref attribute for nodes that don't support it, and since there is no ambiguity if a given xml:id points to a person or a place, I could just pack them in as a space-delimited list.

syvwlch commented 2 years ago

corresp attribute can be used to point to other elements that correspond 'in some way' with the current element... perhaps I can use it to point to location, and keep ana to point to persons present... altho I wanted to differentiate between physical and remote presence?

syvwlch commented 2 years ago

Perhaps we can dissociate location from presence by using different <milestone> types to encode them. It will require a little more parsing on the analysis side... look for latest milestone of each type for each row, and slap value in a separate column.

That seems pretty clean, and if we decide to add something else to track this way, it can be added independently.

syvwlch commented 2 years ago

Let's try:

syvwlch commented 2 years ago
cdrigby commented 2 years ago

Seems like a reasonable framework. Do you have a unit="scene" type="depart" for someone leaving the conversation? Or perhaps that is anticipated.

[edit] Now that I think about it, it's probably not necessary. You're analyzing things based on words/conversation. Whether someone leaves or enters mid-scene is irrelevant. The data is captured in just what they say.

syvwlch commented 2 years ago

No, it's a valid question since I want to track presence even for characters without a line of dialog.

The idea is to just list all present characters at each milestone. So when a character departs, the new milestone would list everyone there minus them.

This means that when retrieving data, I only ever have to look at the last milestone before the line of dialog to get the complete roster.

cdrigby commented 2 years ago

So a new milestone for any roster change. Got it.

syvwlch commented 2 years ago

Applied to first four chapters, no major issues. I'll do same to the other eight that have been marked up so far, and then see what I can pull into the public repo using this new metadata.

syvwlch commented 2 years ago

Applied thru chapter 12, no major issues during application, mostly seems to make sense. Had to add a few more fine-grained locations to the standOff metadata section.

syvwlch commented 2 years ago

Next step is to add these to the data retrieved via Xquery and available here.

syvwlch commented 2 years ago

I have a first attempt at a scene output system sitting in the private repo. The said.xml output file has an extra column with a unique numeric id for each scene change, and there is a separate scenes.xml output file that lists all of them, with the same id, and groups location, present, and remotely present characters for each. Included page for convenience, even tho it is redundant.

I separated scenes to their own file because otherwise they'd get repeated for each line of dialog, which seems excessive, especially since present and remote characters are space-delimited lists liable to contain many values.

The main ugliness left is that the some 'scenes' don't actually have content, e.g. when both location and present characters change at the same time, that increments the scene counter twice.

A smaller ugliness left is that the first couple scenes in the output file have empty present and/or remote nodes since they don't have a predecessor <milestone> node of the right type. I'm not sure if that will break ingestion in R or if those get populated with NA gracefully. Will just need to try.

syvwlch commented 2 years ago

Note that if empty nodes in the xml output file get ingested as NAs gracefully, I can remove a bunch of if-then statements in all of my Xquery queries, which would be rather nice.

syvwlch commented 2 years ago

The XML package does not crash when it finds an empty node in the file it is reading in. It sets the value in the column to an empty string in that case. So there is no need for if-then in the Xquery script for those cases, and the empty strings can be forced to NA during data cleanup in R.

syvwlch commented 2 years ago

Alright, done. The main ugliness is left... but unsure if I want to fix it. I could use additional <div>s to wrap the scenes, but this seems extreme as they are not present in the text.

Also, I think that unlike with the lines of dialogs, I do want a single unique scene identifier (that is, I do not want to number the scenes within a chapter, but globally across the series of books) so that joins are easier. For the same reason, I did not include the page the scene starts on, as it would be redundant during joins.

cdrigby commented 2 years ago

So to fully separate scenes from rosters perhaps you would also need a separate "roster" element? If you don't need it then it seems fine to just increment the scene counter by two, understanding that a change of roster is a change of scene.

syvwlch commented 2 years ago

The real alternative is to use a single <milestone unit="scene"/> element to denote a scene change (location or characters) in the text, with a ref attribute pointing to something in the standOff metadata section outside the text. I could then have proper lists of persons, dates, etc... there. I decided not to do it that way, but considering making the switch.

syvwlch commented 2 years ago

It will make for cleaner data exports and will support whatever metadata I want to add inside of standOff so more flexible and futureproof... so very tempted. Just means some refactoring of both the Digital Edition and the scene.csv logic.

cdrigby commented 2 years ago

In terms of effort, can you do it with a portion of the text to develop a sense of how useful/interesting the external StandOff metadata would be? What metadata would you propose to use, other than the scene (location) vs roster consideration?

syvwlch commented 2 years ago

I've only edited 14 chapters so far, and since the scenes for those are done, it really shouldn't be too bad. I'll make a branch in git to isolate it, of course, in case I decide it's a dumb idea. As for extra metadata, things like date or presence of objects come to mind? Dates currently span entire chapters but later in this series this is no longer true.

cdrigby commented 2 years ago

Ah, dates in particular seem significant.

Sounds like it's off to work at the refactory!

syvwlch commented 2 years ago

Ok, I think I have a working attempt at a new system.

In the text a scene change is marked by a single <milestone> element with a unique xml:id ref to which a list item in the <standOff> section points, and contains whatever <date> or <location> or list of pointers to the characters present.