Generate triples from TEI XML documents

kuhumcst / glossematics

The life of Louis Hjelmslev.

https://glossematics.dk

4 stars 1 forks source link

Generate triples from TEI XML documents #18

Closed simongray closed 2 years ago

simongray commented 3 years ago

Regardless of which database is eventually going to be used (see: #4) it 99% likely that I will be using a triplestore of some kind. There is functionality available in Cuphic (see: https://github.com/kuhumcst/cuphic/issues/1) to facilitate this, although it may have to be tweaker further.

I now have access to the university's "N drive" where it should be possible to find sample data. The task is now to recursively go through each document in a list of documents and return metadata valid triples. These triples should be derived from both the actual metadata in the TEI header, as well as metadata in the contents + possible implied metadata that can be derived from the content, e.g. the presence of certain words or some other feature.

simongray commented 3 years ago

Currently I need

to attain access to the drive where the Infrastrukturisme data is
- ... so that I can get the latest sample data
- ... as well as the TEI manual, which I will formally base most of the extraction code on.

simongray commented 3 years ago

New unified XML parser in Cuphic: https://github.com/kuhumcst/cuphic/commit/40a32dd302fbde9e4b6334c7c120ebad32b83922

This is now the common platform from which both frontend UI is generated and backend database metadata is to be extracted.