albbas commented 10 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 1802

Date: 2014-01-24T16:25:17+01:00 From: Ciprian Gerstenberger <> To: Børre Gaup <> CC: ciprian.gerstenberger, lene.antonsen, sjur.n.moshagen, trond.trosterud

Last updated: 2014-02-03T20:36:25+01:00

albbas commented 10 years ago

Comment 9008

Date: 2014-01-24 16:25:17 +0100 From: Ciprian Gerstenberger <>

In the current state the output from the analysis is used both for compiling the corpus for KORP and for linguists' work using cat, ccat, etc. Yet the XML-entities in the - and -elements are "escaped", i.e., & instead of &, ≶ instead of <, etc.

XSLtemplate 1.19 ; file-specific xsl $Revision: 1.1 $; common.xsl $Revision: 68074 $; "<Sámi_Radio>" "Sámi_Radio" MWE N <sme> Prop Sem/Org Sg Gen "<ođđasat>" :

This is not useful.

The and - content shoud be output as CDATA.

Google with "output CDATA xsl xml" for more info. For instance: http://www.w3schools.com/xsl/el_output.asp

This way, Børre doesn't need to maintain to different pipelines and the old data format can be easily extractet from the xml files enriched with meta-data.

albbas commented 10 years ago

Comment 9050

Date: 2014-01-31 11:48:45 +0100 From: Børre Gaup <>

Jeg tror den beste måten å plukke ut denne dataen er ved å bruke plukke ut teksten i -elementet. ccat gjør det nå.

Noe á la dette burde være mulig?

albbas commented 10 years ago

Comment 9055

Date: 2014-01-31 15:44:17 +0100 From: Trond Trosterud <>

Og korfor trur du det er betre enn å bruke CDATA?

albbas commented 10 years ago

Comment 9056

Date: 2014-01-31 16:18:49 +0100 From: Ciprian Gerstenberger <>

Nei, det er ikke det. Jeg har snakket med Børre og har forklart problemet. Egentlig er løsninga alerede i skripten her:

Author: ciprian Date: 2014-01-25 09:05:51 +0100 (láv, 25 ođđj 2014) New Revision: 87530 Added: trunk/gt/script/corpus/correct_cdata.xsl Log: script to correct CDATA of the current analysis output

Det er bare å endre pipelinen fra analysen, en helt minimal endring.

albbas commented 10 years ago

Comment 9070

Date: 2014-02-03 19:47:48 +0100 From: Børre Gaup <>

I think we'll just keep it the way it is, the debugging is not hampered by the fact that the content of the analysis element contains text instead of cdata.

albbas commented 10 years ago

Comment 9076

Date: 2014-02-03 20:27:00 +0100 From: Trond Trosterud <>

To repeat the issue: Cip: we have xml entities instead of < and >: "<Sámi_Radio>" "Sámi_Radio" MWE N <sme> Prop Sem/Org Sg Gen Cip: this will be rendered correctly if we define it as CDATA instead of as text. Cip: Bonus: one pipeline, not two.

Børre: I think text is better than CDATA.

Trond: I miss some explict pro vs. con here:

1 - Is it true that content-as-text gives two pipelines, but context-as-CDATA gives one? 2 - is ccat as fast as cat? 3 - are there other issues than 1, 2 relevant for the choice of text vs. CDATA?

albbas commented 10 years ago

Comment 9077

Date: 2014-02-03 20:33:22 +0100 From: Ciprian Gerstenberger <>

Yes, I agree. I have already a script to convert the data as needed. So a small step on my pipeline is no problem. I do a proper test anyhow, so it is quite ok with me. Børre, you can just close this bug.

(In reply to comment #4)

I think we'll just keep it the way it is, the debugging is not hampered by the fact that the content of the analysis element contains text instead of cdata.

albbas commented 10 years ago

Comment 9078

Date: 2014-02-03 20:36:25 +0100 From: Børre Gaup <>

And if you run ccat on the analysed files, it spits out clean text versions of them.

giellalt / bugzilla-dummy

Analysis output in XML-files with meta-infos (Bugzilla Bug 1802) #163

Bugzilla Bug 1802

Comment 9008

Comment 9050

Comment 9055

Comment 9056

Comment 9070

Comment 9076

Comment 9077

Comment 9078