Convert Wikipedia pages to HTML/XML

GoogleCodeExporter commented 8 years ago

Copied from:
https://groups.google.com/forum/?fromgroups#!topic/bliki/eBsfyHZ4xVY

I think I can not use the dumps, because I would like to process the article of 
the day, which isn't effective available as a dump (for example 
"http://de.wikipedia.org:80/w/api.php?action=query&titles=Wikipedia:Hauptseite/A
rtikel_des_Tages/Montag&prop=revisions&rvprop=content&format=xml"). I would 
also like to avoid the overhead by using the filesystem and derby database for 
downloading/caching. I just need to parse/convert the response of the URL 
above, by providing a string or stream.

Original issue reported on code.google.com by axelclk@gmail.com on 26 May 2012 at 2:29

GoogleCodeExporter commented 8 years ago

Commited r5313.

With this change the example call:

testWikipediaENAPI("Wikipedia:Hauptseite/Artikel_des_Tages/Montag", 
"http://de.wikipedia.org/w/api.php", Locale.GERMAN);

creates at least the HTML file and downloads the referenced image file.

To avoid caching the template files you can copy (don't derive) your own wiki 
model from the  APIWikiModel and override the getRawWikiContent() method and
eliminate the usage of the Derby database.
If a template name is requested in your getRawWikiContent() method you don't 
use fWikiDB.selectTopic() but your own static files. 
So you hould be able to eliminate the dependency from Derby database.

Original comment by axelclk@gmail.com on 26 May 2012 at 2:42

GoogleCodeExporter commented 8 years ago

Wow, thank you for the fast support. :-)

I have successfully eliminated the downloading of images and the usage of the 
Derby database by deriving an "own" wiki model as you suggested.

But the User object is still required to let the model load the templates and 
article via the URL, itself. Do you have an idea how this can be eliminated? At 
the moment I parse the raw wiki text itself and let bliki just converting the 
article text. But my self-implemented parsing logic of raw wiki text is very 
complicated and can not handle all situations. Therefore I would like to use 
bliki also for parsing the raw wiki text. From the current parsing and 
rendering I have a lot of unit tests which test the final rendering with 
various contents. These tests can not be switched to the new bliki integration, 
because I can only call bliki with a URL, but can not inject the pre-loaded raw 
wiki text as a string. The unit tests should of course not load the text via 
the internet. This is also not possible, because the content behind the URL of 
the article of the day is changed weekly. ;-) So I would need a way to 
initialize the wiki model with a pre-loaded raw wiki text as a string or 
InputStream or I need a way to mock the remote call for loading the article. I 
could not yet find a way. I tried to replace "List<Page> thePages = 
myUser.queryContent(thePageTitles)" within "getRawWikiContent" with 
"XMLPagesParser theParser = new XMLPagesParser(theRawWikiTextAsString).parse; 
List<Page> thePages = theParser.getPagesList();", but could not yet get it 
working because this method is also used to load templates.

Thank you in advance

Regards,

Sven S.

Original comment by sven.strohschein@googlemail.com on 26 May 2012 at 10:11

Attachments:

APIWikiModelLite.java

GoogleCodeExporter commented 8 years ago

I committed r5377.

With these new methods:
DocumentCreator#renderToFile(String rawWikiText, String title, ITextConverter 
converter, String filename) throws IOException;

HTMLCreatorExample#testWikipediaText(String rawWikiText, String title, Locale 
locale);

you can render a wiki text snippet directly into a file.

This is a quick and dirty solution.
You should copy DocumentCreator to your own class and delete/refactor the 
things you don't need.

If possible please contribute back your finished solution, so that other users 
can also use your Creator and WikiModel classes.

Original comment by axelclk@gmail.com on 28 May 2012 at 9:48

GoogleCodeExporter commented 8 years ago

Hi,

it is almost done and I will post it or provide a patch when it is ready. One 
thing regarding the article image is strange. The example at the bottom 
contains the image name/reference "Datei:Nyatapole2.jpg", but when I convert it 
to HTML with bliki, it results in "Datei:116px-Nyatapole2.jpg". The image size 
is appended to the filename which isn't correct. The concrete image can be 
found via 
"http://de.wikipedia.org/w/api.php?action=query&titles=Datei:Nyatapole2.jpg&prop
=imageinfo&iiprop=url&format=xml", but not with the bliki-modified image name: 
"http://de.wikipedia.org/w/api.php?action=query&titles=Datei:116px-Nyatapole2.jp
g&prop=imageinfo&iiprop=url&format=xml".

Do you have an idea why this is happening and how it can be avoided?

Example

<?xml version="1.0"?><api><query><normalized><n 
from="Wikipedia:Hauptseite/Artikel_des_Tages/Donnerstag" 
to="Wikipedia:Hauptseite/Artikel des 
Tages/Donnerstag"/></normalized><pages><page pageid="964888" ns="4" 
title="Wikipedia:Hauptseite/Artikel des Tages/Donnerstag"><revisions><rev 
xml:space="preserve">{{Shortcut|WP:ADTDO}}{{Wikipedia:Hauptseite/Artikel des 
Tages/Bearbeitungshinweise}}
<onlyinclude> {{AdT-Vorschlag
| DATUM = 28.07.2011
| LEMMA = Bhaktapur
| BILD = Datei:Nyatapole2.jpg 
| BILDBESCHREIBUNG = Nyata-Tempel, 1708 erbaut, dreißig Meter hoch und der 
hinduistischen Gottheit Lakshmi geweiht 
| BILDGROESSE = 116px 
| BILDUMRANDUNG = 
| TEASERTEXT = '''[[Bhaktapur]]''' (nepali ??????? ‚Stadt der Frommen‘) 
oder ''Khwopa'' (newari ???? ''Khvapa'') ist neben Kathmandu und Lalitpur mit 
über 78.000 Einwohnern die dritte und kleinste der Königsstädte des 
Kathmandutals in Nepal. Bhaktapur liegt am Fluss Hanumante und wie Kathmandu an 
einer alten Handelsroute nach Tibet, was für den Reichtum der Stadt 
verantwortlich war. Das Bild der Stadt wird bestimmt von der Landwirtschaft, 
der Töpferkunst und besonders von einer lebendigen traditionellen 
Musikerszene. Wegen seiner über 150 Musik- und 100 Kulturgruppen wird 
Bhaktapur als Hauptstadt der darstellenden Künste Nepals bezeichnet. Die 
Einwohner von Bhaktapur gehören ethnisch zu den Newar und zeichnen sich durch 
einen hohen Anteil von 60 Prozent an Bauern der Jyapu-Kaste aus. Die Bewohner 
sind zu fast 90 Prozent Hindus und zu zehn Prozent Buddhisten. Vom 14. 
Jahrhundert bis zur zweiten Hälfte des 18. Jahrhunderts war Bhaktapur 
Hauptstadt des Malla-Reiches. Aus dieser Zeit stammen viele der 172 
Tempelanlagen, der 32 künstlichen Teiche und der mit Holzreliefs verzierten 
Wohnhäuser. Zwar verursachte ein großes Erdbeben 1934 viele Schäden an den 
Gebäuden, doch konnten diese wieder so instand gesetzt werden, dass Bhaktapurs 
architektonisches Erbe bereits seit 1979 auf der UNESCO-Liste des 
Weltkulturerbes steht.
}} </onlyinclude>
[[Kategorie:Wikipedia:Hauptseite/Artikel des 
Tages|Donnerstag]]</rev></revisions></page></pages></query></api>

Original comment by sven.strohschein@googlemail.com on 31 May 2012 at 6:52

GoogleCodeExporter commented 8 years ago

I'm appending the width with the "iiurlwidth" parameter like this
http://de.wikipedia.org/w/api.php?action=query&titles=Datei:Nyatapole2.jpg&prop=
imageinfo&iiprop=url&format=xml&iiurlwidth=116

See the example I've commited: r5528

See the info.bliki.wiki.impl.APIWikiModel#appendInternalImageLink() method for 
details;
http://code.google.com/p/gwtwiki/source/browse/trunk/info.bliki.wiki/bliki-pdf/s
rc/main/java/info/bliki/wiki/impl/APIWikiModel.java

Original comment by axelclk@gmail.com on 7 Jun 2012 at 7:02

GoogleCodeExporter commented 8 years ago

Hm, I tried to overwrite appendInternalImageLink, but the call parameters have 
already the "extended" image filename. Therefore appendInternalImageLink can 
not cause the magic extension.

hrefImageLink = "Datei:116px-Nyatapole2.jpg"
srcImageLink = "116px-Nyatapole2.jpg"

Original comment by sven.strohschein@googlemail.com on 7 Jun 2012 at 9:58

GoogleCodeExporter commented 8 years ago

Hi,

I have created a new "in-memory" APIWikiModel along with an example and another 
modification to the DocumentCreator. Everything is contained within the 
attached patch (SVN). Is it possible to apply and commit this patch?

Regards,

Sven S.

Original comment by sven.strohschein@googlemail.com on 20 Aug 2012 at 7:35

Attachments:

in-memory-support.patch

GoogleCodeExporter commented 8 years ago

I added your patch with this commit: r6831.

Original comment by axelclk@gmail.com on 21 Aug 2012 at 9:32

GoogleCodeExporter commented 8 years ago

Hi,

thanks for adding the patch.

I detected two new problems which I have fixed with another patch. Could you 
please also add this patch?

1. Problem: When the image file has a SVG extension, the extension is changed 
from ".svg" to ".svg.png" by the WikiModel. This behavior isn't desired in the 
in-memory model, because it breaks the image URL. I added a quick-fix like with 
the file-size extensions I described above. This should be improved in the 
future for example by override possibilities of the WikiModel.

2. Problem: I had the problem that an article image weren't detected, because 
the prefix/namespace check for images does not work sometimes. 
INamespace#getImage() returned "Datei" (german locale for "File") and 
INamespace#getImage() returned "Image", but the article contained "File" (not 
localized). So these three prefixes should get checked, because some article 
requests return "Datei" and some other articles return "File".

Original comment by sven.strohschein@googlemail.com on 1 Sep 2012 at 11:21

Attachments:

in-memory-support-2.patch

GoogleCodeExporter commented 8 years ago

Added you patch with commit r6896.

Original comment by axelclk@gmail.com on 2 Sep 2012 at 8:55

GoogleCodeExporter commented 8 years ago

Hi,

I have improved the code again and final. The solution is now more stable (an 
error occurred when the original image name contained "-" sign), the code is 
now clean (the ToDo could also be solved) and it should be better for the 
performance.

Could you please integrate the patch in 3.0.20? I hope it can also get deployed 
to Sonatype soon. :-)

Thanks.

Original comment by sven.strohschein@googlemail.com on 17 Oct 2013 at 9:27

Attachments:

in-memory-support-3.patch

GoogleCodeExporter commented 8 years ago

I think the issue can also get marked as fixed when the 
in-memory-support-3.patch is applied.

Original comment by sven.strohschein@googlemail.com on 17 Oct 2013 at 9:29

GoogleCodeExporter commented 8 years ago

Committed r9124 and r9125

Original comment by axelclk@gmail.com on 20 Oct 2013 at 5:14

stm2 / gwtwiki

Convert Wikipedia pages to HTML/XML #96