sasansom / sedes

Metrical position in Greek hexameter.
9 stars 3 forks source link

Update TEI parser to handle Hellenistic and Imperial texts #9

Closed whoopsedesy closed 4 years ago

whoopsedesy commented 4 years ago

8 added new texts, but their format differs slightly from the previously existing ones and tei2csv cannot handle them yet.

As an example, some of the new texts use lowercase <div type="book" ...> instead of the uppercase <div type="Book" ...> that tei2csv currently conservatively checks for.

$ make corpus/aratus.csv
src/tei2csv "Aratus" "corpus/aratus.xml" > "corpus/aratus.csv"
Traceback (most recent call last):
  File "src/tei2csv", line 80, in <module>
    process(f)
  File "src/tei2csv", line 45, in process
    for line in doc.lines():
  File "src/tei.py", line 189, in lines
    for x in do_elem(self.soup.find("text").body, Environment()):
  File "src/tei.py", line 168, in do_elem
    assert elem.get("type") in ("Book", "Hymn")
AssertionError
whoopsedesy commented 4 years ago

Some of the new texts use a peculiar betacode encoding with diacriticals that precede the letter they apply to. (Diacriticals normally follow the letter they apply to, except in the case of capital letters marked with a *, in which case they may precede the letter, but the * must come first in the sequence, as I understand it.) I want to check whether I'm interpreting these correctly or whether they represent transcription errors. These are not all the problematic sequences, just a few representative ones.

Argon. 2.232 =*)a dei/l' I want to interpret this as Α͂̓ δείλ’; i.e. capital alpha with circumflex and smooth breathing. Perseus Hopper does something weird here, leaving the circumflex hanging in the breeze. Note the initial character is a single left quote that opens a direct quotation. perseus

Argon 4.757 =*)iri is another instance, again following an left single quote.

Callim. Hymn. 3.1 (ou) I want this to be ὁὐ.

Dion. 1.129 (a)mei/lixe, fei/deo kou/rhs.) Are the outer ( ) meant to be actual parentheses and not rough/smooth breathing marks? The ) follows not a letter but a dot.

Am I on the right track in wanting to apply diacriticals to the letter that follows in these cases, or do they look like more textual errors?

whoopsedesy commented 4 years ago

With #11 fixed, aratus.xml is working. The others are hung up on the beta code issues mentioned in the previous comment.

sasansom commented 4 years ago

These are all errors.

For Argon. 2.232 (which I think should be 2.244): In most other editions of these texts, the diacritics visually precede a capital letter, e.g. e.g. Argon. 2.757 in Diogenes (ed. Fraenkel 1961/1970; NB the soft breathing is nestled in the circumflex):

Screen Shot 2020-03-17 at 7 25 16 PM

The beta code may be trying to achieve this by preposing the diacritics in the code. The mistake at Argon. 2.244 is that the capital symbol should be first, followed by the breathing and then the accent. Cf. the correctly encoded "*)=wfilai" (Ὦφιλαι Argon. 1.657).

The same goes for Argon. 4.757, which should be *)=iri.

For Callim. Hymn. 3.1, the correct code is "ou)"; the problem is that the beta code is trying to begin a parenthetical statement (i.e. open a parenthesis) before the word. Cf. Diogenes ad loc.:

Screen Shot 2020-03-17 at 7 46 21 PM

The mistake continues when it attempts to close the parenthetical remark at the end of the line: laqe/sqai). I think the correct code for parenthesis is "[1" and "]1".

Dion. 1.129 is trying to open and close a quotation embedded within direct speech (i.e. a quotation within a quotation). It appears to be using the diacritic as a single quotation mark.

whoopsedesy commented 4 years ago

It's disheartening to see these errors in the source. It would not be so bad if not for the fact that we cannot detect all of them automatically—only the ones where they happen to trip up some other checking rule. We could attempt to infer, for example, that ( at the beginning of a word is always an open parenthesis and not rough breathing (already a questionable inference given the preposed diacriticals you fixed in b058312), but no such rule is possible for ). Take this for example:

(alfa beta) gamma)

There's no simple way (short of having a human expert look at it) to know whether it is supposed to represent (αλφα βετἀ γαμμα) or (αλφα βετα) γαμμἀ.

Anyway, here's the current complete list of lines that cause beta code errors: https://gist.github.com/whoopsedesy/164df709a7d7d3d782ea2dd6f89fa389 It's pretty big: 1 in Argon., 4 in Callim. Hymns, 34 in Dion., 10 in Quint. Smyrn., 312 in Theoc. There may be some that you can easily fix manually; but for the bulk of it it may be best if you can identify some common patterns that we can replace programmatically, either in the source text or in the beta code parser.

sasansom commented 4 years ago

Agreed. A lot of the errors in Dion. arise from an attempt to encode quotations within quotations. It seems the encoder used parentheses to make single quotation marks that indicate direct speech within direct speech, e.g. Dion. 1.129. I wasn't sure how to properly encode this; the TLG Beta Code Manual (http://stephanus.tlg.uci.edu/encoding/BCM.pdf) seems to suggest code that Perseus Beta Code does not use (e.g. the bigram "3 encodes a single left quotation). It appears I have two options: 1) use , which I think would confuse things since it is within a quote; 2) replace "(" and ")" with an apostrophe (cf. Dion. 48.559 h)io/nes na/coio, boh/sate: 'numfi/e qhseu=,). I went with option 2. Ideally there would be specific code for direct quotations embedded within direct quotations; such information would be valuable for other research interests, especially in narratology.

Most if not all of the errors in Theocritus were incorrect sequences of capitalization, breath, and accent—apparently it was encoded with a different notion of the correct order.

whoopsedesy commented 4 years ago

Ideally there would be specific code for direct quotations embedded within direct quotations

Quotations don't have to be represented with beta code only. TEI has a q element for marking up quotations. It's already used in many places, including in Dion. Oh, but I see you have already marked it up with q, so I suppose we're all good.

whoopsedesy commented 4 years ago

Every file is able to be processed now. In eb9c9d8 I added the new texts to Makefile and make.sh.

Here are the line number warnings from processing the new files. For whatever reason a lot of lines are out of order in the XML file compared to their stated numbers. (It's not just the files that are new in this ticket; the ones in the older corpus have it as well.)

src/tei2csv "Argon." "corpus/argonautica.xml" > "corpus/argonautica.csv"
warning: after line '2.383', expected '2.384', got '2.382'
warning: after line '3.738', expected '3.739', got '3.740'
warning: after line '4.543', expected '4.544', got '4.546'

src/tei2csv "Dion." "corpus/nonnusdionysiaca.xml" > "corpus/nonnusdionysiaca.csv"
warning: after line '4.364', expected '4.365', got '4.377'
warning: after line '4.388', expected '4.389', got '4.365'
warning: after line '4.376', expected '4.377', got '4.389'
warning: after line '5.117', expected '5.118', got '5.115a'
warning: after line '5.115a', expected '5.116', got '5.118'
warning: after line '5.248', expected '5.249', got '5.255'
warning: after line '5.255', expected '5.256', got '5.249'
warning: after line '5.254', expected '5.255', got '5.256'
warning: after line '5.429', expected '5.430', got '5.431'
warning: after line '5.431', expected '5.432', got '5.430'
warning: after line '5.430', expected '5.431', got '5.432'
warning: after line '6.162', expected '6.163', got '6.164'
warning: after line '6.164', expected '6.165', got '6.163'
warning: after line '6.163', expected '6.164', got '6.165'
warning: after line '6.275', expected '6.276', got '6.277'
warning: after line '6.277', expected '6.278', got '6.276'
warning: after line '6.276', expected '6.277', got '6.278'
warning: after line '7.174', expected '7.175', got '7.180'
warning: after line '7.189', expected '7.190', got '7.175'
warning: after line '7.179', expected '7.180', got '7.190'
warning: after line '9.132', expected '9.133', got '9.132'
warning: after line '11.442', expected '11.443', got '11.446'
warning: after line '11.477', expected '11.478', got '11.443'
warning: after line '11.445', expected '11.446', got '11.478'
warning: after line '14.239', expected '14.240', got '14.246'
warning: after line '14.246', expected '14.247', got '14.240'
warning: after line '14.245', expected '14.246', got '14.247'
warning: after line '14.350', expected '14.351', got '14.352'
warning: after line '14.352', expected '14.353', got '14.351'
warning: after line '14.351', expected '14.352', got '14.353'
warning: after line '17.19', expected '17.20', got '17.21'
warning: after line '17.21', expected '17.22', got '17.20'
warning: after line '17.20', expected '17.21', got '17.22'
warning: after line '17.72', expected '17.73', got '17.74'
warning: after line '17.374', expected '17.375', got '17.377'
warning: after line '17.377', expected '17.378', got '17.375'
warning: after line '17.376', expected '17.377', got '17.378'
warning: after line '18.66', expected '18.67', got '18.68'
warning: after line '18.68', expected '18.69', got '18.67'
warning: after line '18.67', expected '18.68', got '18.69'
warning: after line '18.222', expected '18.223', got '18.229'
warning: after line '18.229', expected '18.230', got '18.224'
warning: after line '18.224', expected '18.225', got '18.223'
warning: after line '18.223', expected '18.224', got '18.225'
warning: after line '18.225', expected '18.226', got '18.227'
warning: after line '18.228', expected '18.229', got '18.226'
warning: after line '18.226', expected '18.227', got '18.230'
warning: after line '21.73', expected '21.74', got '21.74-75'
warning: after line '21.74-75', expected '21.75', got '21.75-74'
warning: after line '21.221', expected '21.222', got '21.227'
warning: after line '21.247', expected '21.248', got '21.222'
warning: after line '21.226', expected '21.227', got '21.248'
warning: after line '22.4', expected '22.5', got '22.6'
warning: after line '22.6', expected '22.7', got '22.5'
warning: after line '22.5', expected '22.6', got '22.7'
warning: after line '22.118', expected '22.119', got '22.120'
warning: after line '23.132', expected '23.133', got '23.134'
warning: after line '23.134', expected '23.135', got '23.133'
warning: after line '23.133', expected '23.134', got '23.135'
warning: after line '25.304', expected '25.305', got '25.307'
warning: after line '25.308', expected '25.309', got '25.305'
warning: after line '25.306', expected '25.307', got '25.309'
warning: after line '25.353', expected '25.354', got '25.355'
warning: after line '25.528', expected '25.529', got '25.528'
warning: after line '26.56', expected '26.57', got '26.59'
warning: after line '26.59', expected '26.60', got '26.57'
warning: after line '26.58', expected '26.59', got '26.60'
warning: after line '26.206', expected '26.207', got '26.212'
warning: after line '26.214', expected '26.215', got '26.207'
warning: after line '26.211', expected '26.212', got '26.215'
warning: after line '27.152', expected '27.153', got '27.157'
warning: after line '27.158', expected '27.159', got '27.153'
warning: after line '27.156', expected '27.157', got '27.161'
warning: after line '27.161', expected '27.162', got '27.159'
warning: after line '27.160', expected '27.161', got '27.162'
warning: after line '27.227', expected '27.228', got '27.231'
warning: after line '27.236', expected '27.237', got '27.228'
warning: after line '27.230', expected '27.231', got '27.237'
warning: after line '28.59', expected '28.60', got '28.61'
warning: after line '28.62', expected '28.63', got '28.60'
warning: after line '28.60', expected '28.61', got '28.63'
warning: after line '28.84', expected '28.85', got '28.83'
warning: after line '28.87', expected '28.88', got '28.90'
warning: after line '28.94', expected '28.95', got '28.93'
warning: after line '28.97', expected '28.98', got '28.100'
warning: after line '28.250', expected '28.251', got '28.257'
warning: after line '28.259', expected '28.260', got '28.360'
warning: after line '28.364', expected '28.365', got '28.265'
warning: after line '28.277', expected '28.278', got '28.251'
warning: after line '28.256', expected '28.257', got '28.278'
warning: after line '28.302', expected '28.303', got '28.309'
warning: after line '28.318', expected '28.319', got '28.322'
warning: after line '28.323', expected '28.324', got '28.303'
warning: after line '28.308', expected '28.309', got '28.319'
warning: after line '28.321', expected '28.322', got '28.324'
warning: after line '29.157', expected '29.158', got '29.160'
warning: after line '29.160', expected '29.161', got '29.158'
warning: after line '29.159', expected '29.160', got '29.161'
warning: after line '29.204', expected '29.205', got '29.211'
warning: after line '29.212', expected '29.213', got '29.205'
warning: after line '29.210', expected '29.211', got '29.213'
warning: after line '29.271', expected '29.272', got '29.273'
warning: after line '29.273', expected '29.274', got '29.272'
warning: after line '29.272', expected '29.273', got '29.274'
warning: after line '30.6', expected '30.7', got '30.8'
warning: after line '30.8', expected '30.9', got '30.7'
warning: after line '30.7', expected '30.8', got '30.9'
warning: after line '30.57', expected '30.58', got '30.60'
warning: after line '30.60', expected '30.61', got '30.58'
warning: after line '30.59', expected '30.60', got '30.61'
warning: after line '31.153', expected '31.154', got '31.155'
warning: after line '31.155', expected '31.156', got '31.154'
warning: after line '31.154', expected '31.155', got '31.156'
warning: after line '31.235', expected '31.236', got '31.238'
warning: after line '31.250', expected '31.251', got '31.252'
warning: after line '31.252', expected '31.253', got '31.251'
warning: after line '31.251', expected '31.252', got '31.253'
warning: after line '31.271', expected '31.272', got '31.273'
warning: after line '31.273', expected '31.274', got '31.272'
warning: after line '31.272', expected '31.273', got '31.274'
warning: after line '31.274', expected '31.275', got '31.236'
warning: after line '31.237', expected '31.238', got '31.275'
warning: after line '32.13', expected '32.14', got '32.16'
warning: after line '32.35', expected '32.36', got '32.14'
warning: after line '32.15', expected '32.16', got '32.36'
warning: after line '32.86', expected '32.87', got '32.88'
warning: after line '32.90', expected '32.91', got '32.87'
warning: after line '32.87', expected '32.88', got '32.90'
warning: after line '35.66', expected '35.67', got '35.68'
warning: after line '35.68', expected '35.69', got '35.67'
warning: after line '35.67', expected '35.68', got '35.69'
warning: after line '35.293', expected '35.294', got '35.295'
warning: after line '35.296', expected '35.297', got '35.294'
warning: after line '35.294', expected '35.295', got '35.297'
warning: after line '37.94', expected '37.95', got '37.96'
warning: after line '37.96', expected '37.97', got '37.95-97'
warning: after line '37.95-97', expected '37.96', got '37.97-95'
warning: after line '37.299', expected '37.300', got '37.303'
warning: after line '37.303', expected '37.304', got '37.300'
warning: after line '37.302', expected '37.303', got '37.304'
warning: after line '37.568', expected '37.569', got '37.572'
warning: after line '37.573', expected '37.574', got '37.575'
warning: after line '37.599', expected '37.600', got '37.601'
warning: after line '37.601', expected '37.602', got '37.568'
warning: after line '37.571', expected '37.572', got '37.602'
warning: after line '37.625', expected '37.626', got '37.625'
warning: after line '37.655', expected '37.656', got '37.658'
warning: after line '37.659', expected '37.660', got '37.656'
warning: after line '37.657', expected '37.658', got '37.660'
warning: after line '39.279', expected '39.280', got '39.281'
warning: after line '39.281', expected '39.282', got '39.280'
warning: after line '39.280', expected '39.281', got '39.284'
warning: after line '39.284', expected '39.285', got '39.282'
warning: after line '39.283', expected '39.284', got '39.285'
warning: after line '39.379', expected '39.380', got '39.381'
warning: after line '39.381', expected '39.382', got '39.380'
warning: after line '39.380', expected '39.381', got '39.382'
warning: after line '39.385', expected '39.386', got '39.385'
warning: after line '39.393', expected '39.394', got '39.395'
warning: after line '40.105', expected '40.106', got '40.108'
warning: after line '40.108', expected '40.109', got '40.106'
warning: after line '40.107', expected '40.108', got '40.109'
warning: after line '40.133', expected '40.134', got '40.135'
warning: after line '40.136', expected '40.137', got '40.134'
warning: after line '40.134', expected '40.135', got '40.137'
warning: after line '40.143', expected '40.144', got '40.145'
warning: after line '40.150', expected '40.151', got '40.154'
warning: after line '40.154', expected '40.155', got '40.151'
warning: after line '40.153', expected '40.154', got '40.155'
warning: after line '40.489', expected '40.490', got '40.492'
warning: after line '40.492', expected '40.493', got '40.490'
warning: after line '40.491', expected '40.492', got '40.493'
warning: after line '40.564', expected '40.565', got '40.566'
warning: after line '40.566', expected '40.567', got '40.565'
warning: after line '40.568', expected '40.569', got '40.570'
warning: after line '41.21', expected '41.22', got '41.50'
warning: after line '41.50', expected '41.51', got '41.22'
warning: after line '41.49', expected '41.50', got '41.51'
warning: after line '41.224', expected '41.225', got '41.228'
warning: after line '41.229', expected '41.230', got '41.225'
warning: after line '41.227', expected '41.228', got '41.230'
warning: after line '41.278', expected '41.279', got '41.280'
warning: after line '42.64', expected '42.65', got '42.71'
warning: after line '42.169', expected '42.170', got '42.171'
warning: after line '42.171', expected '42.172', got '42.170'
warning: after line '42.170', expected '42.171', got '42.172'
warning: after line '42.187', expected '42.188', got '42.189'
warning: after line '42.189', expected '42.190', got '42.188'
warning: after line '42.188', expected '42.189', got '42.190'
warning: after line '42.198', expected '42.199', got '42.200'
warning: after line '42.221', expected '42.222', got '42.224'
warning: after line '42.232', expected '42.233', got '42.222'
warning: after line '42.223', expected '42.224', got '42.233'
warning: after line '42.274', expected '42.275', got '42.65'
warning: after line '42.70', expected '42.71', got '42.275'
warning: after line '42.294', expected '42.295', got '42.301'
warning: after line '42.302', expected '42.303', got '42.295'
warning: after line '42.300', expected '42.301', got '42.303'
warning: after line '42.371', expected '42.372', got '42.374'
warning: after line '42.375', expected '42.376', got '42.372'
warning: after line '42.373', expected '42.374', got '42.376'
warning: after line '43.26', expected '43.27', got '43.28'
warning: after line '43.28', expected '43.29', got '43.27'
warning: after line '43.27', expected '43.28', got '43.29'
warning: after line '43.82', expected '43.83', got '43.85'
warning: after line '43.85', expected '43.86', got '43.83'
warning: after line '43.84', expected '43.85', got '43.86'
warning: after line '43.146', expected '43.147', got '43.163'
warning: after line '43.164', expected '43.165', got '43.147'
warning: after line '43.162', expected '43.163', got '43.165'
warning: after line '43.281', expected '43.282', got '43.283'
warning: after line '43.283', expected '43.284', got '43.282'
warning: after line '43.282', expected '43.283', got '43.284'
warning: after line '44.29', expected '44.30', got '44.33'
warning: after line '44.33', expected '44.34', got '44.30'
warning: after line '44.32', expected '44.33', got '44.34'
warning: after line '44.118', expected '44.119', got '44.121'
warning: after line '44.122', expected '44.123', got '44.119'
warning: after line '44.120', expected '44.121', got '44.123'
warning: after line '44.136', expected '44.137', got '44.138'
warning: after line '44.138', expected '44.139', got '44.147'
warning: after line '44.147', expected '44.148', got '44.139'
warning: after line '44.145', expected '44.146', got '44.145'
warning: after line '44.146', expected '44.147', got '44.148'
warning: after line '46.14', expected '46.15', got '46.16'
warning: after line '46.17', expected '46.18', got '46.15'
warning: after line '46.15', expected '46.16', got '46.18'
warning: after line '46.245', expected '46.246', got '46.247'
warning: after line '46.247', expected '46.248', got '46.246'
warning: after line '46.246', expected '46.247', got '46.248'
warning: after line '46.250', expected '46.251', got '46.252'
warning: after line '46.252', expected '46.253', got '46.251'
warning: after line '46.251', expected '46.252', got '46.253'
warning: after line '47.224', expected '47.225', got '47.226'
warning: after line '47.226', expected '47.227', got '47.225'
warning: after line '47.225', expected '47.226', got '47.227'
warning: after line '48.290', expected '48.291', got '48.292'
warning: after line '48.298', expected '48.299', got '48.291'
warning: after line '48.291', expected '48.292', got '48.299'
warning: after line '48.337', expected '48.338', got '48.339'
warning: after line '48.339', expected '48.340', got '48.338'
warning: after line '48.338', expected '48.339', got '48.340'
warning: after line '48.351', expected '48.352', got '48.353'
warning: after line '48.353', expected '48.354', got '48.352'
warning: after line '48.352', expected '48.353', got '48.354'
warning: after line '48.587', expected '48.588', got '48.589'
warning: after line '48.589', expected '48.590', got '48.588'
warning: after line '48.588', expected '48.589', got '48.590'
warning: after line '48.590', expected '48.591', got '48.592'
warning: after line '48.593', expected '48.594', got '48.591'
warning: after line '48.591', expected '48.592', got '48.594'
warning: after line '48.908', expected '48.909', got '48.910'

src/tei2csv "Quint. Smyrn." "corpus/quintussmyrnaeus.xml" > "corpus/quintussmyrnaeus.csv"
warning: after line '2.613', expected '2.614', got '2.612a'
warning: after line '2.612a', expected '2.613', got '2.614'
warning: after line '7.299', expected '7.300', got '7.319a'
warning: after line '7.319a', expected '7.320', got '7.300'
warning: after line '13.432', expected '13.433', got '13.432b'

src/tei2csv "Theoc." "corpus/theocritus.xml" > "corpus/theocritus.csv"
warning: after line '1.108', expected '1.109', got '1.110'
warning: after line '2.63', expected '2.64', got '2.65'
warning: after line '5.42', expected '5.43', got '5.45'
warning: after line '5.71', expected '5.72', got '5.70'
warning: after line '10.20', expected '10.21', got '10.20'
warning: after line '14.13', expected '14.14', got '14.10'
warning: after line '15.7', expected '15.8', got '15.5'
warning: after line '15.15', expected '15.16', got '15.15'
warning: after line '15.25', expected '15.26', got '15.25'
warning: after line '15.31', expected '15.32', got '15.31'
warning: after line '15.41', expected '15.42', got '15.41'
warning: after line '15.60', expected '15.61', got '15.60'
warning: after line '15.67', expected '15.68', got '15.65'
warning: after line '15.76', expected '15.77', got '15.75'
warning: after line '21.40', expected '21.41', got '21.40'
warning: after line '21.65', expected '21.66', got '21.65'
warning: after line '22.70', expected '22.71', got '22.70'
warning: after line '27.11', expected '27.12', got '27.10'