searope / jwpl

Automatically exported from code.google.com/p/jwpl
0 stars 0 forks source link

Some redirects are not regarded in DataMachine transformation process #1

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Working with JWPL 0.45b I encountered the following problem: 

I created an own Wikipedia dump using German Wikipedia backup dump files of
November 11, 2009 from download.wikimedia.org. I followed the steps explained
on the JWPL documentation page and ran the transformation process using the new
DataMachine (Version 2), that was kindly provided to me by Mr. Zesch. 

The creation of the SQl dump file was succesfull. I could retrieve pages and
process text, just as I could with the German Wikipedia SQL dump of 6 Feb 2007
provided on the JWPL homepage.

However, when comparing some results of the new Wikipedia dump with that of
2007, I could see that certain redirects, but not all, were missing. They were
however included in the online version of Wikipedia. I assumed that there was
some database mistake, but also in the output text files, namely
"page_redirects.txt" they did not appear. Some further investigation in the
online Wikipedia showed that the error was systematical:

Whenever a redirect page included a redirect link of the exact format "REDIRECT
[[...]]" (i.e.: the capitalized Redirect-keyword followed by a space), the
redirect did appear in the database. 
But, whenever the format was slightly different, the redirect was missing. 

Examples:
Missing space: REDIRECT[[...]]
not capitalized: Redirect [[...]]
German key word: WEITERLEITUNG [[...]]

I ran the DataMachine again, but the problem remained. Interestingly, in the
Wikipedia SQL dump of 6 Feb 2007, the problem does not appear.

Kind regards
Stephan Strohmaier

Original issue reported on code.google.com by torsten....@gmail.com on 21 Sep 2010 at 4:05

GoogleCodeExporter commented 9 years ago

Original comment by torsten....@gmail.com on 21 Sep 2010 at 4:15

GoogleCodeExporter commented 9 years ago

Original comment by oliver.ferschke on 16 Feb 2012 at 1:24