srophe / caesarea-data

Data repository for Caesarea-Maritima.org
0 stars 2 forks source link

Normalize current, temporary encoding of related-subjects #108

Closed wlpotter closed 2 years ago

wlpotter commented 3 years ago

Currently they are supposed to be in /TEI/text/body/note/p elements, but the entries are somewhat messy and not all of them are in <p> elements.

Eventually (see #101) we will encode them in a more explicit fashion than notes and paragraphs. But before that it would be a good idea to clean up and normalize this part of our data.

This is somewhat of an extension of #103 and should wait for that issue to be closed before proceeding.

wlpotter commented 2 years ago

FYI, the difference between this and #101 is that here we are just trying to get the data to be in a uniform format. #101 is for the editorial and technical decisions of how to encode them for release (we will use a script to convert the normalized data to that format). In other words, this issue has priority since we need the data to be uniform before transforming it into its final encoding.

wlpotter commented 2 years ago

I have identified the variants from the desired note/p/text() structure. There are only a few cases where we can't programmatically adjust.

Otherwise, I can write a script that creates a sequence of all the related-subjects entries and puts them in the correct structure /note/p/text(). It can also normalize white space and order them alphabetically. This should allow use to adjust the way of encoding this whenever we're ready (#101)

wlpotter commented 2 years ago

I have this script ready to go and will create these changes in a separate branch for us to review. (Having trouble figuring out the whitespace and writeback issues