mkroetzsch / wda

Several scripts to analyse Wikidata dumps
32 stars 16 forks source link

Syntax error in the extracted dump wda-export-data.py #1

Closed hadyelsahar closed 11 years ago

hadyelsahar commented 11 years ago

i use this command to extract links dumps python wda-export-data.py -e turtle-links

then this command to convert the links dump from turtle to Ntriples format rapper -i turtle turtle-20130811-links.ttl > turtle-20130811-links.nt

but the extracted dumb appears to have some syntax errors considering some < bracket specifically in line 13250337

rapper: Error - URI file:///root/hady_wikidata_extraction/wda/results/turtle-20130811-links.ttl:13250337 - syntax error at '<'

notes :

mkroetzsch commented 11 years ago

Thanks. Unfortunately, my machine does not have enough memory to run the rapper command on the whole file (it uses some 4.5 G and then dies; I have 8G but other applications also want to run ;-). I recently ran rapper on a few million initial lines without problems, but I will try again tomorrow (specifically for the links file). It can be that we have different files since the generated dump depends on the options used and also on the time of creation (since the online dumps change over time). Can you maybe extract the offending line in your file?

hadyelsahar commented 11 years ago

thanks Markus for your concern

the rapper works now until it extracts 7.5M lines in Ntriples format then it crashes for a syntax error exactly at the lines 13250337 as shown in the error message : rapper: Error - URI file:///root/hady_wikidata_extraction/wda/results/turtle-20130811-links.ttl:13250337 - syntax error at '<'

hadyelsahar commented 11 years ago

this is the line that encountered the error and nearest lines :

<http://es.wikipedia.org/wiki/Freedom_%28canci%C3%B3n_de_Wham%21%29>
    a   so:Article ;
    so:about    w:Q1453506 ;
    so:inLanguage   "es" .

w:Q1453550
    a   wo:Item .

<http://de.wikipedia.org/wiki/FreedroidRPG>
    a   so:Article ;
    so:about    w:Q1453550 ;
    so:inLanguage   "de" . 

line 13250337 that encountered the problem is this one : <http://de.wikipedia.org/wiki/FreedroidRPG>

mkroetzsch commented 11 years ago

I don't see any problem with this line. Am I missing something? I also tried parsing the links on my machine and I did not encounter any problems with these lines (my dump is 20130808 and not 20130811 but all the lines you mention are contained in it as well). I don't see any error in this RDF so far.

My rapper is based on Raptor 2.0.6. Which version do you use?

hadyelsahar commented 11 years ago

yeah the lines are not indicative to syntax error and i'd relate this to (maybe) some sort of unbalanced brackets or something in the previous lines.

i'm using Raptor 2.0.6 as well on the extracted dump , i've tried it

using the same command : rapper -i turtle filein > fileout , and they all encountered errors . are you trying to parse the whole file ? or just part of it ? i'm using the "-links" version of the dump so it wouldn't be the same line if you are using the full dump

jcsahnwaldt commented 11 years ago

Maybe it's a bug in rapper. I guess you should try to narrow down the line that is actually causing the error. For example, use awk or some other bash tools to split the file: include the first few lines that define the prefixes, skip the following 13000000 lines, include the rest. A command like this should work:

gzip -d <turtle-20130808-statements.ttl.gz | awk '{if (NR<34 || NR>13000000) print}' | gzip -c >turtle-20130808-statements-part.ttl.gz

Of course, you'll first have to find a line that starts with a subject. 13000000 is probably in the middle of a block of triples for a subject (one downside of not using a one-triple-per-line format). I'm also not 100% sure if there may be statements that somehow depend on earlier statements.

jimkont commented 11 years ago

Usually a turtle error in rapper lies at the right next statement. So, If the error is in line "http://de.wikipedia.org/wiki/FreedroidRPG" you should check the following starting statement

Can you post these lines too?

hadyelsahar commented 11 years ago

@jcsahnwaldt : it seems that it's working , but raptor showed a out of dynamic memory exception , i'll open a thread on the mailing list with that issue then , when we solve it we can come here again and try on the dump

@jimkont : i checked the next lines but they don't contain any errors as well

jcsahnwaldt commented 11 years ago

Really looks like a rapper bug to me. Probably can't handle large files and doesn't always display the correct error. I guess the alleged syntax error is actually caused by lack of memory that is not detected correctly.

These bug reports may be relevant:

http://bugs.librdf.org/mantis/view.php?id=512 TriG parser crash: out of dynamic memory in turtle_lexer__scan_bytes()

http://bugs.librdf.org/mantis/view.php?id=525 NTriplesParser aborts while opening large file

mkroetzsch commented 11 years ago

I also think that this might be the case, since I also get similar memory errors (but not the same on all files) when trying to load the whole dump on my machine. I will close this report now. Thanks all for investigating.

hadyelsahar commented 11 years ago

i've tried to use Another Turtle to NTriples parser which is Serd : http://drobilla.net/software/serd/ and i was able to extract 100M language links triples from the turtle file .

jcsahnwaldt commented 11 years ago

Cool! One question: 100 million links is a lot. I counted 30 million in the dump file Daniel prepared in June. Are you sure that's correct?

hadyelsahar commented 11 years ago

yes you are right , the Scala code for LL extraction extracts 32M Language links , i didn't use the right terms , i meant the 100M Ntriples were extracted from the wda links turtle dump

mkroetzsch commented 11 years ago

Thanks for the pointer to Serd -- great tool. It still complains on the statements file since it does not like the character 0x7 ("bell") which is used on https://www.wikidata.org/wiki/Q815674; not sure if this is allowed in Turtle or not (the grammar of the candidate rec seems to allow it). I have contacted the author for advice.