giellalt / bugzilla-dummy

0 stars 0 forks source link

xml entity references not converted in ccat (Bugzilla Bug 340) #863

Closed albbas closed 17 years ago

albbas commented 18 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 340

Date: 2006-09-05T16:12:15+02:00 From: Saara Huhmarniemi <> To: Tomi Pieski <> CC: trond.trosterud

Last updated: 2006-11-20T12:39:11+01:00

albbas commented 18 years ago

Comment 1128

Date: 2006-09-05 16:12:15 +0200 From: Saara Huhmarniemi <>

Some of the characters in the xml-documents are (or at least should be) coded as xml-entities: & & " " < < > > etc.

Some of the characters cannot be unescaped, since the requirements in xmltwig. The ccat does not handle these elements correctly, < becomes <^@t; (or actually something different which looks like this.) See e.g. ccat -l sme -r zcorp/bound/sme/facta/callinravvagat.pdf.xml |less

This has consequences to the analysis and also to the conversion to the ims corpus database.

albbas commented 17 years ago

Comment 1200

Date: 2006-11-20 12:39:11 +0100 From: Saara Huhmarniemi <>

This bug is fixed by reading the files straight to the xml-structure. This means that ccat is not used when the corpus is analyzed. There were also other reasons to do that, e.g the requirement to access the metainformation in the header.