Bugzilla Bug 340

Date: 2006-09-05T16:12:15+02:00 From: Saara Huhmarniemi <> To: Tomi Pieski <> CC: trond.trosterud

Last updated: 2006-11-20T12:39:11+01:00

albbas commented 18 years ago

Comment 1128

Date: 2006-09-05 16:12:15 +0200 From: Saara Huhmarniemi <>

Some of the characters in the xml-documents are (or at least should be) coded as xml-entities: & & " " < < > > etc.

Some of the characters cannot be unescaped, since the requirements in xmltwig. The ccat does not handle these elements correctly, < becomes <^@t; (or actually something different which looks like this.) See e.g. ccat -l sme -r zcorp/bound/sme/facta/callinravvagat.pdf.xml |less

This has consequences to the analysis and also to the conversion to the ims corpus database.

albbas commented 17 years ago

Comment 1200

Date: 2006-11-20 12:39:11 +0100 From: Saara Huhmarniemi <>

This bug is fixed by reading the files straight to the xml-structure. This means that ccat is not used when the corpus is analyzed. There were also other reasons to do that, e.g the requirement to access the metainformation in the header.

giellalt / bugzilla-dummy

xml entity references not converted in ccat (Bugzilla Bug 340) #863

Bugzilla Bug 340

Comment 1128

Comment 1200