giellalt / bugzilla-dummy

0 stars 0 forks source link

Malformed utf-8 somewhere stops corpus analysis (Bugzilla Bug 946) #257

Closed albbas closed 13 years ago

albbas commented 13 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 946

Date: 2011-02-19T09:33:03+01:00 From: Trond Trosterud <> To: Børre Gaup <> CC: ciprian.gerstenberger, sjur.n.moshagen, tomi.k.pieski, trond.trosterud

Last updated: 2011-05-02T12:42:49+02:00

albbas commented 13 years ago

Comment 3747

Date: 2011-02-19 09:33:03 +0100 From: Trond Trosterud <>

~/freecorpus$ccat -l sme -r converted/sme/ |preprocess --abbr=~/gtsvn/gt/sme/bin/abbr.txt |usme|lookup2cg|vislcg3 -g ~/gtsvn/gt/sme/bin/sme-dis.bin > ~/gtsvn/gt/sme/dev/analyse/free.1491.dis VISL CG-3 Disambiguator version 0.9.7.6599 Codepage: default UTF-8, input UTF-8, output UTF-8, grammar UTF-8 Info: Binary grammar detected. 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% Parsing grammar took 0.268409 seconds. Grammar has 17 sections, 0 templates, 3527 rules, 3930 sets, 951 c-tags, 8719 s-tags. 34 rules cannot be skipped by index. Malformed UTF-8 character (fatal) at /Users/trond/gtsvn/gt/script/preprocess line 162, <> line 212692. ~/freecorpus$

albbas commented 13 years ago

Comment 3748

Date: 2011-02-19 09:37:35 +0100 From: Trond Trosterud <>

The freecorp contains 7.7 mill, the analysis halted at some 6.2

~/freecorpus$ccat -l sme -r converted/sme/ | wc -w 7733158

~/freecorpus$cat ~/gtsvn/gt/sme/dev/analyse/free.1491.dis|grep '^\"'|wc -l 6246523

albbas commented 13 years ago

Comment 3861

Date: 2011-04-22 16:51:45 +0200 From: Trond Trosterud <>

Obsolete report.

albbas commented 13 years ago

Comment 3931

Date: 2011-04-26 09:11:21 +0200 From: Sjur Nørstebø Moshagen <>

It isn't obsolete. It is either fixed or not. Reopened until a test is provided documenting it is fixed.

Don't close a bug before discussing it with the asignee or other stakeholders, or providing tests that documents that it has been fixed. Being old doesn't mean being irrelevant or obsolete.

albbas commented 13 years ago

Comment 3990

Date: 2011-04-30 09:44:10 +0200 From: Trond Trosterud <>

This bug has been marked as a duplicate of bug #969

albbas commented 13 years ago

Comment 3993

Date: 2011-04-30 09:57:08 +0200 From: Trond Trosterud <>

This bug has been marked as a duplicate of bug #878

albbas commented 13 years ago

Comment 4008

Date: 2011-05-02 12:42:49 +0200 From: Børre Gaup <>

This is fixed. convert2xml.pl has a check that guards againt invalid utf8.