snikproject / snik-tag

Tagging Tool for the SNIK Project. See
https://snikproject.github.io/snik-tag/
MIT License
0 stars 1 forks source link

Entities found in DOCX document with no highlighting #39

Closed KonradHoeffner closed 3 years ago

KonradHoeffner commented 3 years ago

@AlfredWinter procedure in January version:

KonradHoeffner commented 3 years ago

The XPath expression seems to be wrong for that DOCX format.

There are w:i expressions, but they are set to false, which we current don't detect with the expression //w:r[w:rPr/w:i]

<w:r>
<w:rPr>
<w:b w:val="false"/>
<w:bCs w:val="false"/>
<w:i w:val="false"/>
<w:iCs w:val="false"/>
<w:lang w:val="en-US"/>
</w:rPr>
<w:t xml:space="preserve">
In chapter 3, we have discussed the technological perspective of health information systems. We will now examine how health information systems have to be managed so that they will fulfil the requirements of the stakeholders as presented in chapter .
</w:t>
</w:r>

Rewrite the XPath expression. Analogously for bold and underlined.

KonradHoeffner commented 3 years ago

//w:r[w:rPr/w:i[not(@w:val='false')]]

KonradHoeffner commented 3 years ago

There were two problems:

  1. That DOCX format had existing but deactivated highlighting. Modified the parser to account for that.
  2. That DOCX has highlighting in the headlines, which needs to be changed in the DOCX file.