giellalt / bugzilla-dummy

0 stars 0 forks source link

sum-cg.pl does not tolerate empty lines (Bugzilla Bug 982) #320

Closed albbas closed 11 years ago

albbas commented 13 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 982

Date: 2011-04-21T11:17:21+02:00 From: Trond Trosterud <> To: Ciprian Gerstenberger <> CC: ciprian.gerstenberger, sjur.n.moshagen, trond.trosterud

Last updated: 2012-10-30T23:06:50+01:00

albbas commented 13 years ago

Comment 3852

Date: 2011-04-21 11:17:21 +0200 From: Trond Trosterud <>

sum-cg.pl is a really nice perl script that Saara wrote to measure homonymy on CG output. It works perfectly, but: it does not tolerate empty lines between cohorts. So:

This is bad for sum-cg.pl:

"" "sojjehtidh" V TV Inf @-FMAINV "<.>" "." CLB

"<Dïejvesedåehkie>" "dïejvese#dåehkie" N Sg Nom "" "lea" V Ind Prs Sg3 @+FAUXV

whereas this, without the empty line after CLB, is good:

"" "sojjehtidh" V TV Inf @-FMAINV "<.>" "." CLB "<Dïejvesedåehkie>" "dïejvese#dåehkie" N Sg Nom "" "lea" V Ind Prs Sg3 @+FAUXV

With well-behaved input it gives sensible result, with the input with empty lines it says Line not recognized: , Line not recognized: , Line not recognized: , ... etc.

It seems the only time there are empty lines in CG input is just after CLB. Now, to cope with this i do grep -v '^$'. But then, I always forget, and just then do the grep -v trics. Anyone wanting to improve their perl skills may thus give this a go: change sum-cg.pl so that it allows the empty behaviour.

albbas commented 13 years ago

Comment 4407

Date: 2011-06-06 14:45:31 +0200 From: Sjur Nørstebø Moshagen <>

This isn't really my table, so sending it forward (not that I'm sure it is Ciprian's bug either).

albbas commented 13 years ago

Comment 4408

Date: 2011-06-06 15:23:18 +0200 From: Ciprian Gerstenberger <>

Wie war Tronds Lieblingssprichwort auf Deutsch? Aaaa, Perle für die Schweine... Ergo: Perl fuer mich.

Eigentlich muss es heissen: "Perlen vor die Säue werfen" "Ihr sollt das Heilige nicht den Hunden geben und eure Perlen sollt ihr nicht vor die Säue werfen, damit sie dieselben nicht zertreten mit ihren Füßen und sich wenden und euch zerreißen." So heißt es im Neuen Testament, Matthäus, Kapitel 7, Vers 6 ("Bergpredigt").

(In reply to comment #1)

This isn't really my table, so sending it forward (not that I'm sure it is Ciprian's bug either).

albbas commented 11 years ago

Comment 7203

Date: 2012-10-29 11:31:27 +0100 From: Trond Trosterud <>

Denne hadde det vore nyttig å ordne no. Ritva treng den.

albbas commented 11 years ago

Comment 7220

Date: 2012-10-29 21:57:12 +0100 From: Ciprian Gerstenberger <>

I just tried to reproduced the bug with the two different inputs (see above), but I can't: the script does not output any error if the input is "bad".

The relevant line in the code is:

  1. next if ($line =~ /^\s*$/);

Please give me some more hints on how you have used the script.

(In reply to comment #3)

Denne hadde det vore nyttig å ordne no. Ritva treng den.

albbas commented 11 years ago

Comment 7222

Date: 2012-10-29 23:08:13 +0100 From: Trond Trosterud <>

I might have been a bit too implicit in my first report. Here goes:

To repeat: Log in on the xserve, cd ../hoavda/Public/corp/analyzed/ sum-cg.pl --words analysed/2012-10-26/sme-laws.ccat.txt.dis > testresult1

... and watch the printing of lines like Line not recognized: , Line not recognized: , Line not recognized: , for the next minutes. It might have given good results in the end, i forget, but it simply takes too much time to wait.

Or, eventually, the dirty grep -v fix:

corp$cat analysed/2012-10-26/sme-laws.ccat.txt.dis | grep -v '^$' > testfile corp$sum-cg.pl --words testfile > testresult2 Processing file: testfile corp$head testresult2 191 "" N ABBR Gen @>Num N ABBR Nom @>Num "nr" N ABBR Gen @>Num "nr" N ABBR Nom @>Num 92 "<stáhta>" N Sg Gen @>N Org Plc N Sg Gen @>N "stáhta" N Sg Gen @>N "stáhta" Org Plc N Sg Gen @>N

albbas commented 11 years ago

Comment 7223

Date: 2012-10-29 23:43:01 +0100 From: Ciprian Gerstenberger <>

Now are we talking business (apart from the path correction: analyzed-dir both in cd-command and in sum-cg.pl command). Thanks

(In reply to comment #5)

I might have been a bit too implicit in my first report. Here goes:

To repeat: Log in on the xserve, cd ../hoavda/Public/corp/analyzed/ sum-cg.pl --words analysed/2012-10-26/sme-laws.ccat.txt.dis > testresult1

albbas commented 11 years ago

Comment 7228

Date: 2012-10-30 11:32:15 +0100 From: Ciprian Gerstenberger <>

Debugged, corrected, tested:

on xserve cd ../hoavda/Public/corp/analysed/ sum-cg.pl --words 2012-10-26/sme-laws.ccat.txt.dis

the last lines of output are: 1 "" N Pl Acc @<OBJ Hum N NomAg Pl Acc @<OBJ "oahcci" N Pl Acc @<OBJ "ohcci" Hum N NomAg Pl Acc @<OBJ 1 "" N Prop Sur Sg Gen @>N N Prop Sur Sg Nom @SUBJ> "Mikkelsen" N Prop Sur Sg Gen @>N "Mikkelsen" N Prop Sur Sg Nom @SUBJ> 1 "" Time N Sg Acc @OBJ> Time N Sg Acc @OBJ> "rehketdoallo#jahki" Time N Sg Acc @OBJ> "rehket#doallojahki" Time N Sg Acc @OBJ>

I will close this bug!

albbas commented 11 years ago

Comment 7238

Date: 2012-10-30 21:44:42 +0100 From: Trond Trosterud <>

Let me give this a second thought. Here is output:

cgdev$sum-cg.pl --words ../analysed/2012-10-26/sme-news.ccat.txt.dis > hom_news Processing file: ../analysed/2012-10-26/sme-news.ccat.txt.dis Line not recognized: , jahk

Line not recognized: , jahk

Line not recognized: , term

Line not recognized: , term

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

Line not recognized: , Jon+

cgdev$

The output is ok, though:

cgdev$head hom_news 2764 "" N Prop Fem Sg Attr @>N N Prop Fem Sg Attr @>N "Sara" N Prop Fem Sg Attr @>N "Sarak" N Prop Fem Sg Attr @>N 2691 "<Sámediggi>" Org Build N Sg Nom @SUBJ> N Prop Org Sg Nom @SUBJ> "sámediggi" Org Build N Sg Nom @SUBJ> "Sámediggi" N Prop Org Sg Nom @SUBJ> 2470 "" Pron Indef Pl Acc @<OBJ Pron Indef Sg Acc @<OBJ cgdev$

So, if you can see why, then fix it, if now, I can probably live with the Jon+, since this time I at least get an output.

albbas commented 11 years ago

Comment 7239

Date: 2012-10-30 22:34:11 +0100 From: Ciprian Gerstenberger <>

Trond, this is NOT a bug, this is how the script works. If the input line doesn't have a certain pattern (line 178 in the script) if($line =~ /(\".?\")(\s+.)$/) { is not recognized as having a base form and an analysis

        $base = $1;
        my $analysis = $2;

To quote my boss: This is not a bug but a feature! And I think this should be so. I will try to output the line number of the input file so that one can trace it and check the quality of the input.

(In reply to comment #8)

Let me give this a second thought. Here is output:

So, if you can see why, then fix it, if now, I can probably live with the Jon+, since this time I at least get an output.

albbas commented 11 years ago

Comment 7240

Date: 2012-10-30 22:50:40 +0100 From: Trond Trosterud <>

Du har rett. Eg hadde ikkje fantasi til dette, men det er verkeleg slik:

cgdev$cat ../analysed/2012-10-29/sme-news.ccat.txt.dis|grep 'Jon+'|wc -l 36

Vi lukkar att.

albbas commented 11 years ago

Comment 7241

Date: 2012-10-30 23:06:50 +0100 From: Ciprian Gerstenberger <>

I changed the script so that one gets the line number in the input:

Processing file: sme-news.ccat.txt.dis Line 9334528 not recognized: , jahk

Line 9334529 not recognized: , jahk

Line 13587512 not recognized: , term

Line 13587513 not recognized: , term

Line 19803604 not recognized: , Jon+

Line 19803605 not recognized: , Jon+

By the way, I think somebody put some comments in the dis-file, otherwise I can explain the format of those lines.

(In reply to comment #10)

Du har rett. Eg hadde ikkje fantasi til dette, men det er verkeleg slik:

cgdev$cat ../analysed/2012-10-29/sme-news.ccat.txt.dis|grep 'Jon+'|wc -l 36

Vi lukkar att.