Closed albbas closed 17 years ago
Date: 2006-09-05 16:12:15 +0200
From: Saara Huhmarniemi <
Some of the characters in the xml-documents are (or at least should be) coded as xml-entities: & & " " < < > > etc.
Some of the characters cannot be unescaped, since the requirements in xmltwig. The ccat does not handle these elements correctly, < becomes <^@t; (or actually something different which looks like this.) See e.g. ccat -l sme -r zcorp/bound/sme/facta/callinravvagat.pdf.xml |less
This has consequences to the analysis and also to the conversion to the ims corpus database.
Date: 2006-11-20 12:39:11 +0100
From: Saara Huhmarniemi <
This bug is fixed by reading the files straight to the xml-structure. This means that ccat is not used when the corpus is analyzed. There were also other reasons to do that, e.g the requirement to access the metainformation in the header.
This issue was created automatically with bugzilla2github
Bugzilla Bug 340
Date: 2006-09-05T16:12:15+02:00 From: Saara Huhmarniemi <>
To: Tomi Pieski <>
CC: trond.trosterud
Last updated: 2006-11-20T12:39:11+01:00