Closed albbas closed 11 years ago
Date: 2011-04-21 11:17:21 +0200
From: Trond Trosterud <
sum-cg.pl is a really nice perl script that Saara wrote to measure homonymy on CG output. It works perfectly, but: it does not tolerate empty lines between cohorts. So:
This is bad for sum-cg.pl:
"
"<Dïejvesedåehkie>"
"dïejvese#dåehkie" N Sg Nom
"
whereas this, without the empty line after CLB, is good:
"
With well-behaved input it gives sensible result, with the input with empty lines it says Line not recognized: , Line not recognized: , Line not recognized: , ... etc.
It seems the only time there are empty lines in CG input is just after CLB. Now, to cope with this i do grep -v '^$'. But then, I always forget, and just then do the grep -v trics. Anyone wanting to improve their perl skills may thus give this a go: change sum-cg.pl so that it allows the empty behaviour.
Date: 2011-06-06 14:45:31 +0200
From: Sjur Nørstebø Moshagen <
This isn't really my table, so sending it forward (not that I'm sure it is Ciprian's bug either).
Date: 2011-06-06 15:23:18 +0200
From: Ciprian Gerstenberger <
Wie war Tronds Lieblingssprichwort auf Deutsch? Aaaa, Perle für die Schweine... Ergo: Perl fuer mich.
(In reply to comment #1)
This isn't really my table, so sending it forward (not that I'm sure it is Ciprian's bug either).
Date: 2012-10-29 11:31:27 +0100
From: Trond Trosterud <
Denne hadde det vore nyttig å ordne no. Ritva treng den.
Date: 2012-10-29 21:57:12 +0100
From: Ciprian Gerstenberger <
I just tried to reproduced the bug with the two different inputs (see above), but I can't: the script does not output any error if the input is "bad".
The relevant line in the code is:
Please give me some more hints on how you have used the script.
(In reply to comment #3)
Denne hadde det vore nyttig å ordne no. Ritva treng den.
Date: 2012-10-29 23:08:13 +0100
From: Trond Trosterud <
I might have been a bit too implicit in my first report. Here goes:
To repeat: Log in on the xserve, cd ../hoavda/Public/corp/analyzed/ sum-cg.pl --words analysed/2012-10-26/sme-laws.ccat.txt.dis > testresult1
... and watch the printing of lines like Line not recognized: , Line not recognized: , Line not recognized: , for the next minutes. It might have given good results in the end, i forget, but it simply takes too much time to wait.
Or, eventually, the dirty grep -v fix:
corp$cat analysed/2012-10-26/sme-laws.ccat.txt.dis | grep -v '^$' > testfile
corp$sum-cg.pl --words testfile > testresult2
Processing file: testfile
corp$head testresult2
191
"
Date: 2012-10-29 23:43:01 +0100
From: Ciprian Gerstenberger <
Now are we talking business (apart from the path correction: analyzed-dir both in cd-command and in sum-cg.pl command). Thanks
(In reply to comment #5)
I might have been a bit too implicit in my first report. Here goes:
To repeat: Log in on the xserve, cd ../hoavda/Public/corp/analyzed/ sum-cg.pl --words analysed/2012-10-26/sme-laws.ccat.txt.dis > testresult1
Date: 2012-10-30 11:32:15 +0100
From: Ciprian Gerstenberger <
Debugged, corrected, tested:
on xserve cd ../hoavda/Public/corp/analysed/ sum-cg.pl --words 2012-10-26/sme-laws.ccat.txt.dis
the last lines of output are:
1
"
I will close this bug!
Date: 2012-10-30 21:44:42 +0100
From: Trond Trosterud <
Let me give this a second thought. Here is output:
cgdev$sum-cg.pl --words ../analysed/2012-10-26/sme-news.ccat.txt.dis > hom_news Processing file: ../analysed/2012-10-26/sme-news.ccat.txt.dis Line not recognized: , jahk
Line not recognized: , jahk
Line not recognized: , term
Line not recognized: , term
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
Line not recognized: , Jon+
cgdev$
The output is ok, though:
cgdev$head hom_news
2764
"
So, if you can see why, then fix it, if now, I can probably live with the Jon+, since this time I at least get an output.
Date: 2012-10-30 22:34:11 +0100
From: Ciprian Gerstenberger <
Trond, this is NOT a bug, this is how the script works. If the input line doesn't have a certain pattern (line 178 in the script) if($line =~ /(\".?\")(\s+.)$/) { is not recognized as having a base form and an analysis
$base = $1;
my $analysis = $2;
To quote my boss: This is not a bug but a feature! And I think this should be so. I will try to output the line number of the input file so that one can trace it and check the quality of the input.
(In reply to comment #8)
Let me give this a second thought. Here is output:
So, if you can see why, then fix it, if now, I can probably live with the Jon+, since this time I at least get an output.
Date: 2012-10-30 22:50:40 +0100
From: Trond Trosterud <
Du har rett. Eg hadde ikkje fantasi til dette, men det er verkeleg slik:
cgdev$cat ../analysed/2012-10-29/sme-news.ccat.txt.dis|grep 'Jon+'|wc -l 36
Vi lukkar att.
Date: 2012-10-30 23:06:50 +0100
From: Ciprian Gerstenberger <
I changed the script so that one gets the line number in the input:
Processing file: sme-news.ccat.txt.dis Line 9334528 not recognized: , jahk
Line 9334529 not recognized: , jahk
Line 13587512 not recognized: , term
Line 13587513 not recognized: , term
Line 19803604 not recognized: , Jon+
Line 19803605 not recognized: , Jon+
By the way, I think somebody put some comments in the dis-file, otherwise I can explain the format of those lines.
(In reply to comment #10)
Du har rett. Eg hadde ikkje fantasi til dette, men det er verkeleg slik:
cgdev$cat ../analysed/2012-10-29/sme-news.ccat.txt.dis|grep 'Jon+'|wc -l 36
Vi lukkar att.
This issue was created automatically with bugzilla2github
Bugzilla Bug 982
Date: 2011-04-21T11:17:21+02:00 From: Trond Trosterud <>
To: Ciprian Gerstenberger <>
CC: ciprian.gerstenberger, sjur.n.moshagen, trond.trosterud
Last updated: 2012-10-30T23:06:50+01:00