jimregan / mlode

Automatically exported from code.google.com/p/mlode
0 stars 0 forks source link

xml:lang attributes contain whitespaces #9

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1.
wget http://www.glottolog.org/downloadarea/languoids.rdf.zip 
unzip http://www.glottolog.org/downloadarea/languoids.rdf.zip 
rapper -i rdfxml languoids.rdf -o ntriples > languoids.nt 
rapper -i ntriples -c languoids.nt 2> parseerror.log 

2. Inspecting the "junk" lines with vim, I find "@south africa" or "@central 
african republic". They are converted from the xml:lang attribute of 
skos:altLabel.

You can check it yourself with

cat languoids.nt | grep @south

What is the expected output? What do you see instead?
xml:lang attribute tags should confirm to BCP 47 
(http://tools.ietf.org/rfc/bcp/bcp47.txt) 

Whitespace is not permitted in a language tag.

That's why languoids.nt has 31 triples less than languoids.rdf

Original issue reported on code.google.com by der.brue...@googlemail.com on 16 Jul 2012 at 3:03

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by Johannes...@googlemail.com on 17 Jul 2012 at 1:05

GoogleCodeExporter commented 9 years ago
set to priority low, as it only concerns 33 triples 

Original comment by kur...@googlemail.com on 19 Jul 2012 at 5:57

GoogleCodeExporter commented 9 years ago
Error stems from erroneously importing country names (such as South Africa) as 
codes upstream. Error surfaces only when white space is found in country names, 
but country names without whitespace are not compliant with BCP 47 either.
Fix checks whether the length of the code is either 2 or 3, and discards the 
xml:lang attribute otherwise. 
This works on individual hyperlect pages, 
e.g.http://glottolog.org/resource/hyperlect/id/43265.rdf (previously had 
xml:lang="West Papua, Indonesia"). It is not found in the dump yet as some more 
substantial db changes will be made this week before a new dump will be produced

Original comment by sebastia...@googlemail.com on 23 Jul 2012 at 12:25

GoogleCodeExporter commented 9 years ago

Original comment by sebastia...@googlemail.com on 15 Aug 2012 at 8:06

GoogleCodeExporter commented 9 years ago
please verify

Original comment by sebastia...@googlemail.com on 23 Aug 2012 at 1:55

GoogleCodeExporter commented 9 years ago
There still white spaces in the language tag, just try:

wget http://www.glottolog.org/downloadarea/languoids.rdf.zip 
unzip http://www.glottolog.org/downloadarea/languoids.rdf.zip 
rapper -i rdfxml languoids.rdf -o ntriples > languoids.nt 
rapper -i ntriples -c languoids.nt 2> parseerror.log 
cat languoids.nt | grep @south

Result:
<http://glottolog.livingsources.org/resource/languoid/id/ndeb1236> 
<http://www.w3.org/2004/02/skos/core#altLabel> " Ndebele "@south africa .
<http://glottolog.livingsources.org/resource/languoid/id/ndeb1241> 
<http://www.w3.org/2004/02/skos/core#altLabel> " Ndebele "@south africa .
<http://glottolog.livingsources.org/resource/languoid/id/ndeb1240> 
<http://www.w3.org/2004/02/skos/core#altLabel> " Ndebele "@south africa .

Original comment by mohamedd...@gmail.com on 23 Aug 2012 at 2:26

GoogleCodeExporter commented 9 years ago
You really need to check who this bug is assigned to....
Fix needed by owner means sebastian should have it, but he i snot even in cc....

Original comment by kur...@googlemail.com on 25 Aug 2012 at 7:45

GoogleCodeExporter commented 9 years ago
please use the new n3 dump available at
http://www.glottolog.org/downloadarea/languoids.n3.tgz
The xml-dump is outdated and obsolete

Original comment by sebastia...@googlemail.com on 27 Aug 2012 at 9:14

GoogleCodeExporter commented 9 years ago

Original comment by sebastia...@googlemail.com on 27 Aug 2012 at 9:14

GoogleCodeExporter commented 9 years ago

Original comment by sebastia...@googlemail.com on 27 Aug 2012 at 11:13