swerik-project / riksdagen-records

0 stars 1 forks source link

Create IDs for speeches #13

Open ninpnin opened 2 months ago

ninpnin commented 2 months ago

Current options

BobBorges commented 2 months ago

I think wrapping speeches in divs is short sighted for a "living" resource -- we don't know how many things like this will be tagged and whether all potential things to tag will allow the hierarchical structure required by xml divs. Metadata blocks would allow tagging whatever features independently of other tagged features without creating a bottomless pit of divs.

MansMeg commented 2 months ago

Yes. This is a really good point, its a more future-proof approach.

ninpnin commented 2 months ago

I just realized we could use the n attributes that are available for all elements. From the documentation,

n (number) gives a number (or other label) for an element, which is not necessarily unique within the document.

Then, we would just include the ID in all u elements that belong to the speech. Eg. for the following speech with the ID i-AzXa4EUmTu6mz8YQsCpizb

<note type="speaker" xml:id="i-G36fJpDJFVqwFFQjbknRq2">
  Herr ERIKSSON i Bäeckmora (cp):
</note>
<u xml:id="i-3KxGSd288AdTa9bfy9BtMv" xml:n="i-AzXa4EUmTu6mz8YQsCpizb" next="i-QzTu4nNrn4q8kU1N1u4xZC" who="i-7CXHDen9y2qKcYDisT3zjQ">
  <seg xml:id="i-EV3wMeu3xQ8QuzNwmvWbjM">
    Herr talman! I det som statsrådet Palme sade nu fanns väl egentligen
    [...]
    som en stor del av svenska folket bestämt önskar få ändring i.
  </seg>
</u>
<u xml:id="i-QzTu4nNrn4q8kU1N1u4xZC" xml:n="i-AzXa4EUmTu6mz8YQsCpizb" prev="i-3KxGSd288AdTa9bfy9BtMv" who="i-7CXHDen9y2qKcYDisT3zjQ" next="i-CUrwEDJ9XoTNrw9wfqWrYb">
  <seg xml:id="i-4mZ33Z1km8JtDLieMQPm5Q">
    Jag anser att statsrådet Palme på denna punkt också skulle uppta
    allvar-
  </seg>
</u>
BobBorges commented 2 months ago

What happens when this same fragment gets tagged as multiple things? Do we have multiple n attribs, or multiple IDs in the n?

ninpnin commented 2 months ago

Which fragment are you referring to?

MansMeg commented 2 months ago

We should follow the TEI guidelines. n should be used for page number. https://www.tei-c.org/release/doc/tei-p5-doc/en/html/TS.html#TSBAUT https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-u.html

ninpnin commented 2 months ago

Where does it say that? Here it says

(number) gives a number (or other label) for an element, which is not necessarily unique within the document.

It might be a page number for pb elements, but for u elements I find no such description.

BobBorges commented 2 months ago

@ninpnin I refer to the fragment you posted as an example. It's tagged as a speech with ID, but down the line it may be tagged with other things... an interpellation debate, or some other type of sectioning that may or may not coincide exactly with the speech itself. So how does the approach you describe handle multiple possible xml:n values?

ninpnin commented 2 months ago

@BobBorges Debates are more suited for div-wrapping, along with any non-overlapping sectioning. But if we have other possibly overlapping things, I unfortunately have no solution for that.

BobBorges commented 2 months ago

It seems like putting these thing as element lists in the tei header would be most flexible, and cleanest in the case when a human has to look at the xml.

ninpnin commented 2 months ago

does the schema allow for that?

MansMeg commented 2 months ago

I agree with Bob, that for now the solution "List speeches in the metadata block" sounds like the best one. Would that work with the TEI schema?

I also added a third option. that is to make each speech only be one block and then rather have paragraph breaks within each utterance. It is semantically closer to the TEI schema than how we solve it now (and the id of the u block would be the speech ID). But still, I think the first solution is best.

BobBorges commented 2 months ago

there are a couple of options (parlaclarin): • <TEI><standOff> contains all kinds of stuff that could be useful here • <teiHeader><profileDesc><textDesc> has domain, interaction, purpose • <teiHeader> has a \ elem which takes any kind of metadata in whatever format (dangerously flexible :) )

From the parlaclarin given examples, looks like standOff is the closest to what we want, but we could also consider like listGrp with type attrib, id, and sub elems that contain a referring ID for the segs we want to label.

MansMeg commented 1 month ago