quadrama / DramaNLP

UIMA NLP components for dramatic texts
Apache License 2.0
9 stars 3 forks source link

Find out why we get empty cells in column Speaker.Figure_id #78

Closed nilsreiter closed 4 years ago

nilsreiter commented 5 years ago

Related to https://github.com/quadrama/DramaAnalysis/issues/157.

pagelj commented 4 years ago

This happens when the Speaker IDs in the text are not identical with any ID in <listPerson>

For example, in Faust II (11d11.0), there is a choir in the beginning:

 <sp who="#chor">
    <speaker>CHOR</speaker>
        <stage>
            <hi>einzeln, zu zweien und vielen, abwechselnd und gesammelt.</hi>
         </stage>
         <lg>
            <l>Wenn sich lau die Lüfte füllen</l>
            <l>Um den grünumschränkten Plan,</l>
            <l>Süße Düfte, Nebelhüllen</l>
             <l>Senkt die Dämmerung heran.</l>
          </lg>

which is labeled as

 <person xml:id="chor_anmutige_gegend">
     <persName>CHOR (ANMUTIGE GEGEND)</persName>
 </person>

in <listPerson>

pagelj commented 4 years ago

After checking some more examples, this is primarily a problem with group characters such as alle etc.

pagelj commented 4 years ago

This got fixed in the current GDC version for the example above:

<sp who="#chor_anmutige_gegend">
   <speaker>CHOR</speaker>
    <stage>einzeln, zu zweien und vielen, abwechselnd und gesammelt.</stage>
pagelj commented 4 years ago

So upgrading the qd corpus to the current gdc version should fix the problem

pagelj commented 4 years ago

Also the GerDraCor reader needs to capture <personGrp> tags

pagelj commented 4 years ago

Fixed in 2b9f6239b0037af83019735e9beae5c6057aea35