scaife-viewer / beyond-translation-site

Site used to iterate on translation alignments within the Scaife Viewer ecosystem
3 stars 4 forks source link

Ingest and display LSJ #62

Closed jacobwegner closed 2 years ago

jacobwegner commented 2 years ago

Partial ingestion is visible on beyond-translation-gagdt-dev.

Need to flesh out concrete steps to properly ingest entries, improve formatting, handle nested senses, etc.

jacobwegner commented 2 years ago

@jtauber and I took a look at some of the underlying XML for Cunliffe and LSJ and have pivoted a bit on this.

There is variance between the two dictionaries in how bibl and cit elements are nested. There are also parts of the LSJ which seem to be missing sense elements from the markup.

(I can try and come back to this comment later and excerpt them).

We think a good "middle of the road" approach for LSJ (which will also benefit parts of Cunliffe too) is to:

jacobwegner commented 2 years ago

Been working with a sort of "playground" locally to experiment with the XSL transformations for LSJ.

I'm working to approximate the HTML from Logeion; Logeion has a lot of markup differences that we know we won't have in https://github.com/PerseusDL/lexica.

Logeion:

image

My WIP extraction from lexica:

image

Those links resolve to "work-level" URNs on catalog.perseus.org.

https://gist.github.com/jacobwegner/7c82e85201ea99365cb1528ae8b506bf#file-lsj-aeido-html

jacobwegner commented 2 years ago

Still have a few white-space issues to resolve.

I am also going to punt on sense extraction for this first pass.

1) headwords and blobs 2) headword, definition blob, sense blob 3) citations from definition and sense (allowing expansion)

jacobwegner commented 2 years ago

I have the blob extraction (two entries only) deployed now:

https://beyond-translation-gagdt-dev.herokuapp.com/reader/urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.1?entryUrn=urn%3Acite2%3Ascafife-viewer%3Adictionary-entries.atlas_v1%3Alsj-1

image

jacobwegner commented 2 years ago

@jtauber and I did a good first pass over markup / betacode issues; we do want to discuss a couple of underlying XML issues on today's call; this Gist has entries for ἄειδε:

https://gist.github.com/jacobwegner/0937195b81f6e13a31cde473987b936c

jacobwegner commented 2 years ago

@gregorycrane I think I've made enough progress on normalizing the headwords for LSJ to consider it "good enough" for release to the wider site.

Here's an example below for urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.1@θεά:

The headword defined in the Cunliffe XML is θεά:

https://beyond-translation-dev.perseus.org/reader/urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.1?entryUrn=urn%3Acite2%3AexploreHomer%3Aentries.atlas_v1%3A1.4731

image

But in LSJ the headword is θεά1:

https://beyond-translation-dev.perseus.org/reader/urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.1?entryUrn=urn%3Acite2%3Ascafife-viewer%3Adictionary-entries.atlas_v1%3Alsj-47947

image

We normalize the differing alpha variants (and strip the 1 from the LSJ entry) so that both entries are resolved for that lemma.

jacobwegner commented 2 years ago

@gregorycrane One last thing I'll leave here (as it is LSJ-specific)

From an LSJ entry for μῆνις: https://beyond-translation-dev.perseus.org/reader/urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.1-1.7?entryUrn=urn%3Acite2%3Ascafife-viewer%3Adictionary-entries.atlas_v1%3Alsj-67481

image

If a user clicks an Iliad or Odyssey reference (e.g. Il. 5.34), we'll load the passage in beyond translation in a new window / tab:

https://beyond-translation.perseus.org/reader/urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:5.34

image

For all other references (e.g. Or. 22.265d), we'll load the work entry on catalog.pereus.org:

https://catalog.perseus.org/catalog/urn:cts:greekLit:tlg2001.tlg022

image
jacobwegner commented 2 years ago

(with the idea that we can improve this "resolution" down the line to resolve to scaife.perseus.org or Perseus 4 if we have a matching edition)

jacobwegner commented 2 years ago

This has been released to production via v2022-05-18-001