Using API/SQLite to access txt/html

oakenshield commented 8 years ago

Hello, thank you for developing this really nice framework! I would like to use the db to query the html/text files of a Wiki. I couldn't find JavaDoc or a description of the API. However I managed to query the xowa-sqlite db files. But the actual texts seem to be compressed. What library are you using? Alternatively: How could I use the API to search for articles or iterate over all articles and fetch the text/html content? Btw: What are you using to process the html from the XML dumps? I've been using Sweble for some time.

Best wishes,

Rüdiger

gnosygnu commented 8 years ago

Hi! Thanks for the compliment as well the interest!

I would like to use the db to query the html/text files of a Wiki. I couldn't find JavaDoc or a description of the API.

Sorry! There really is no API right now (and JavaDoc is non-existent).

However I managed to query the xowa-sqlite db files. But the actual texts seem to be compressed. What library are you using?

The short answer is "gzip" (built in Java library support) with some handmade compression.

The longer answer follows:

Wikitext is Wiki markup. For example, ''italicized text''
- Wikitext is stored in "-text.xowa" databases. For example, "en.wikipedia.org-text-ns.000.xowa" and "simple.wikipedia.org-text.xowa".
- Wikitext is generated by XOWA through the menu options in Tools -> "Import Online" and "Import Offline"
- Wikitext is stored as .gz by default. If you want to change this to plain text, do the following:
- Go to home/wiki/Options/Import
- Change "Page storage format" from "gzip" to "text"
- Press "Save" in the upper left hand corner
HTML is the product of the Wiki markup. For example, <i>italicized text</i>
- HTML is stored in "-html.xowa" databases. For example, "en.wikipedia.org-html-ns.000.xowa" and "simple.wikipedia.org-html.xowa".
- HTML is generated by XOWA in a command-line script. See http://xowa.org/home/wiki/Dev/Command-line/Thumbs. They can also be downloaded online through "Download Central"
- HTML is distributed in .gz format, but the actual HTML is further compressed by a handmade compression algorithm.
- For example, it takes something like <a href="/wiki/A" title="A">A</a> and compresses it to something like ????A where ???? is some sequence of bytes.
- Even despite LZW compression and it's attention to repeated sequences, handmade re-compression of HTML saves several GB over 30 GB.

Alternatively: How could I use the API to search for articles or iterate over all articles and fetch the text/html content?

There really isn't an API now. If you're adventurous, you can do what I do and....

Run http://xowa.org/home/wiki/Dev/Command-line/Thumbs
Disable the custom html compression (hzip_enabled = 'n')
Read the HTML from the generated SQLite databases

However, this process takes a long time. It's about 75 hours on an above-average machine (3.5 GHz processor with 16 GB RAM and SSD). Keep in mind that for 5 million articles, that's still a throughput of 20 articles per second.

I'm rewriting the dumper now to handle multiple threads and hoping to cut this down to 15 hours for a 4+4 core machine (+4 for hyperthreading). When I'm done, I'll put in some sort of interface to output the generated HTML to something else besides SQLite. I've been busy with it the past 2 weeks, but it probably needs another week or two.

If you're interested in the multiple-thread build, let me know, and I'll work with you on getting something set up sooner.

For what it's worth, I know someone else dumped English Wikipedia by sending HTTP requests to the XOWA server. If I remember correctly, they created 20 or so XOWA process and then farmed out page ranges to each worker. Obviously, this requires a lot of setup as well as a super machine (128+GB RAM), but I just thought I'd mention it here

Btw: What are you using to process the html from the XML dumps? I've been using Sweble for some time.

XOWA is using it's own custom parser. It's based off the MediaWiki PHP code, but there are a lot of divergences (particularly to handle a non-server environment).

I honestly haven't heard of Sweble until now. I took a quick look at the code, and it is promising. Here's stuff I think it does well:

Relatively complete implementation of basic wikitext. This includes file links [[File:A.png]]
Good implementations of parser functions such as {{#if}}, {{#switch}}, etc.
Clean syntax / architecture

However, for something, like English Wikipedia you'll also need:

Scribunto handling: Lua scripts that are processed on every page
More complete parser functions: I see a number of todos in {{#expr}} as well as missing {{CURRENTPAGE}} and others
Wikidata interfacing: {{#property}} and related-calls is another separate headache
Other extensions: Notably <poem>, <imagemap>, <hiero> etc
Multi-language support: Not for English Wikipedia, but German, French, etc need their own versions of keywords. For example, French has "#REDIRECTION" for "#REDIRECT"

Anyway, hope this information is useful. Let me know if you have other questions. Thanks!

oakenshield commented 8 years ago

Hello, thank you- using GZInputStream over a ByteArrayInputStream worked perfectly. I started to write some regexp to rewrite the compression of the HTML-Tags but that can't really be the right way :-). Maybe Iam better off using the normally-compressed version. I'd love to discuss this general topic more in detail. Could you contact me? Contact on: https://hucompute.org/team/rudiger-gleim/

gnosygnu commented 8 years ago

Sure! I'll send you a quick email now.

gnosygnu / xowa

Using API/SQLite to access txt/html #71