giellalt / bugzilla-dummy

0 stars 0 forks source link

text_cat throws away xml markup within paragraphs (Bugzilla Bug 821) #97

Closed albbas closed 14 years ago

albbas commented 14 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 821

Date: 2010-01-26T13:24:19+01:00 From: Sjur Nørstebø Moshagen <> To: Ciprian Gerstenberger <> CC: borre.gaup, saara.huhmarniemi, trond.trosterud

Last updated: 2010-02-12T08:17:03+01:00

albbas commented 14 years ago

Comment 3259

Date: 2010-01-26 13:24:19 +0100 From: Sjur Nørstebø Moshagen <>

The error markup conversion from eror§error to <error correct"error">eror is done in a post-processing step in the perl script. This processing contains a bug that causes paragraphs with elements in them, such as when you get a quote in the middle of a sentence, to be disregarded when converting the error markup.

That is, all spelling (and other) errors in such paragraphs are lost during the conversion.

Example:

The following text contains four errors of different types, and as such get correctly converted to xml with the proper tags:

Skåvlån hæhttuji juohkka akta sierra skåvllåbiktasijt adnet. Eskilin le tjáhppis båvså, tjáhppis jali bieddjis skirtto ja alek slipsa£(slippsa). Næjtsojn li sæmmi skåvllåbiktasa, valla sij máhtti aj vuolpov£vuolpojn gárvvunit. Skåvlån le riek garra njuolgadusá ma dasi guosská£(guosski). Ietján li njuolgadusá buorre, javllá Eskil. Skotlándan§Skottlándan e åvdepnamájt ane.

By adding quotes around a part of this text, a element is introduced to capture the quote, which then leads to the paragraph becoming "mixed content" (both pure text and xml elements). The result is as follows:

<p>

Skåvlån hæhttuji juohkka akta sierra skåvllåbiktasijt adnet. Eskilin le tjáhppis båvså, tj áhppis jali bieddjis skirtto ja alek slipsa. Næjtsojn li sæmmi skåvllåbiktasa, valla sij m áhtti aj vuolpov gárvvunit. Skåvlån le riek “garra njuolgadusá” ma dasi guosská. Ietján li njuolgadusá buorre, javllá Eskil. Skotlándan e åvdepnamájt ane.

The quote is captured, but not the error markup.

albbas commented 14 years ago

Comment 3261

Date: 2010-01-26 16:43:12 +0100 From: Sjur Nørstebø Moshagen <>

Changed the subject of this bug after further investigation.

It turns out that the markup is there in one of the temporary files, but it is thrown out in the end if there are quotes within the paragraph. The problem is in the text_cat script, which does language guessing on paragraphs, and optionally on quotes.

When doing quotes (at least), it will take the text content of a paragraph and process it, and in effect remove all xml elements that were there. Whoops - the markup is gone!

The exact problem code seems to be line 198 in text_cat:

my $text = $para->text; # BUG! Removes error markup added

The XML::Twig documentation says:

"text

Return a string consisting of all the PCDATA and CDATA in an element, without any tags. The text is not XML-escaped: base entities such as & and < are not escaped."

What we need instead is an approach in which we traverse all text nodes AND element children (and their text nodes) of a paragraph, doing the wanted processing for each text node in turn.

The following corpus document can be used to trigger the bug:

$CORPUS_HOME/prooftest/orig/smj/facta/Dan-le-danna-infonuorra.correct.doc

In this particular document, there are several quotes in English intermixed with the smj main text. The quotes are properly marked and detected by text_cat, and as such it is quite useful (we can automatically skip all the English fragments), but we loose all spelling error markup at the same time. Since this is a gold standard document with a lot of spelling errors, it is necessary to fix the conversion process.

albbas commented 14 years ago

Comment 3262

Date: 2010-01-26 17:07:14 +0100 From: Sjur Nørstebø Moshagen <>

To reproduce:

cd $CORPUS_HOME/prooftest/orig/smj/facta/ $convert2xml.pl --test --nolog Dan-le-danna-infonuorra.correct.doc

cd $CORPUS_HOME/tmp/ $diff Dan-le-danna-infonuorra.correct.doc.tmp1 Dan-le-danna-infonuorra.correct.doc.tmp0 | l

In the diff, search for 'Leith', and you should find the following interesting before/after snapshot:

< - Leith Academy:an li oahppe 12 jage gitta 18 jage gas kan, ja nav li aj ållo "smávmáná" váttsáldagájn, subtsastallá Eskil.

  • Leith Academy:an li oahppe 12 jage gitta 18 jage gaskan, ja nav li aj ållo "smávmáná" váttsáldagájn, subtsastallá Eskil.

(it is the second diff in the diff output).

In the upper text, there is an error tag, and quotes. In the lower text, the error markup is gone, and the quote is converted to a span. Since the quote is in the same language, no new language is introduced. But just the fact that it looked for something is enough to throw out the other markup.

In the output the spelling error now poses as a legitimate spelling, which of course is a problem in gold standard testing.

albbas commented 14 years ago

Comment 3264

Date: 2010-01-26 17:32:23 +0100 From: Sjur Nørstebø Moshagen <>

The first diff is probably more telling (although bigger and harder to comprehend):

< Eskil danen matematihkkaåhpådiddjev Mis s Watson:in gåhttju. Åhpådid dje li oalle tjiehpe, ållagasj matematihkka- ja sebrudakfáhkaåhpådiddje. Suv skåvlån klássajn ælla guhtik klássaladna nav gåktu Vuonan. Danna li åhpådiddjijn guhtik ladnja, ja oahppe hæhttuji sirddet klássalanjáj milta, ja fága j milta. Eskilin li njiella fága. "Psysical ed ucation" mij le lásjmudallam, "modern studies" mij le sebrudakfáhka, ja ieŋŋilsgiella ja m atematihkka. Suv skåvlån li nuorap ja vuorrasap oahppe.

Eskil danen matematihkkaåhpådiddjev Miss Watson:in gåhttju. Åhpådiddje li oalle tjiehpe, ållagasj matematihkka- ja sebrudakfáhkaåhpådiddje. Suv skåvlån klássajn ælla guhtik kláss aladna nav gåktu Vuonan. Danna li åhpådiddjijn guhtik ladnja, ja oahppe hæhttuji sirddet k lássalanjáj milta, ja fágaj milta. Eskilin li njiella fága. "Psysical education" mij le lásjmudallam, "mo dern studies" mij le sebrudakfáhka, ja ieŋŋilsgiella ja matematihkka. Suv skåvlån l i nuorap ja vuorrasap oahppe.

7 spelling errors vs 2 English quotes. That is quite a number of spelling errors lost from the gold standard.

albbas commented 14 years ago

Comment 3268

Date: 2010-02-11 14:21:25 +0100 From: Saara Huhmarniemi <>

The text string was treated as if it did not contain any tags, as suspected. Now the error tags are taken into account and only the text between the tags is inspected. The text inside error tags is not given span-marking at all. Some more testing perhaps required. The new version is in svn (so not yet installed to the official place).

albbas commented 14 years ago

Comment 3270

Date: 2010-02-11 22:20:44 +0100 From: Sjur Nørstebø Moshagen <>

Thanks for the update, Saara.

I tried to run it locally, but there are too many hardcoded paths and dependencies that it is practical. Is there an easy way of testing the updated code short of installing it?

albbas commented 14 years ago

Comment 3271

Date: 2010-02-11 23:00:47 +0100 From: Sjur Nørstebø Moshagen <>

I finally got to run it after some more path changes:)

I got an error message from the text_cat module, saying:

Argument "\x{20}\x{20}..." isn't numeric in division (/) at /Users/sjur/langtech/main/gt/script/text_cat line 577. Argument "\x{20}\x{20}..." isn't numeric in division (/) at /Users/sjur/langtech/main/gt/script/text_cat line 577. Argument "\x{20}\x{20}..." isn't numeric in division (/) at /Users/sjur/langtech/main/gt/script/text_cat line 577. Argument "\x{20}\x{20}..." isn't numeric in division (/) at /Users/sjur/langtech/main/gt/script/text_cat line 577.

Line 577 in that script is:

my $increment = $maxw/$maxw_lang;

I call the conversion as follows:

$ cd gt/ $ convert2xml.pl --nolog --corpdir=. Dan-le-danna-infonuorra.correct.doc

(all xsl and script files are local files from the same or script dir).

albbas commented 14 years ago

Comment 3272

Date: 2010-02-12 05:17:18 +0100 From: Saara Huhmarniemi <>

I am not sure about the error messages, but they shouldn't be related to the bug. Test text_cat locally:

Convert the file as usual in $CORPUS_HOME Copy first the file with .tmp0 (or .tmp1) ending, since the text_cat writes over the file.

cp $CORPUS_HOME/tmp/Dan-le-danna-infonuorra.correct.doc.tmp0 . gt/script/text_cat -q -x -d $CORPUS_HOME/bin/LM Dan-le-danna-infonuorra.correct.doc.tmp0

Then diff with original.

albbas commented 14 years ago

Comment 3273

Date: 2010-02-12 08:17:03 +0100 From: Sjur Nørstebø Moshagen <>

I tried to follow your instructions for running the script locally, but the .tmp1 or .tmp0 file were deleted before I could copy them.

Anyway, by comparing the local version I made (cf Comment 6) with the one made by the regular script, I got to check the result of the updated script, and it looks good, despite the error message. The only difference from the old output (besides the fixed correct markup conversion) is that earlier there was a newline after each opening

tag — now there isn't.

I consider this bug fixed.