gnosygnu / xowa

xowa offline wiki application
Other
375 stars 41 forks source link

Using API/SQLite to access txt/html #71

Closed oakenshield closed 8 years ago

oakenshield commented 8 years ago

Hello, thank you for developing this really nice framework! I would like to use the db to query the html/text files of a Wiki. I couldn't find JavaDoc or a description of the API. However I managed to query the xowa-sqlite db files. But the actual texts seem to be compressed. What library are you using? Alternatively: How could I use the API to search for articles or iterate over all articles and fetch the text/html content? Btw: What are you using to process the html from the XML dumps? I've been using Sweble for some time.

Best wishes,

Rüdiger

gnosygnu commented 8 years ago

Hi! Thanks for the compliment as well the interest!

I would like to use the db to query the html/text files of a Wiki. I couldn't find JavaDoc or a description of the API.

Sorry! There really is no API right now (and JavaDoc is non-existent).

However I managed to query the xowa-sqlite db files. But the actual texts seem to be compressed. What library are you using?

The short answer is "gzip" (built in Java library support) with some handmade compression.

The longer answer follows:

Alternatively: How could I use the API to search for articles or iterate over all articles and fetch the text/html content?

There really isn't an API now. If you're adventurous, you can do what I do and....

However, this process takes a long time. It's about 75 hours on an above-average machine (3.5 GHz processor with 16 GB RAM and SSD). Keep in mind that for 5 million articles, that's still a throughput of 20 articles per second.

I'm rewriting the dumper now to handle multiple threads and hoping to cut this down to 15 hours for a 4+4 core machine (+4 for hyperthreading). When I'm done, I'll put in some sort of interface to output the generated HTML to something else besides SQLite. I've been busy with it the past 2 weeks, but it probably needs another week or two.

If you're interested in the multiple-thread build, let me know, and I'll work with you on getting something set up sooner.

For what it's worth, I know someone else dumped English Wikipedia by sending HTTP requests to the XOWA server. If I remember correctly, they created 20 or so XOWA process and then farmed out page ranges to each worker. Obviously, this requires a lot of setup as well as a super machine (128+GB RAM), but I just thought I'd mention it here

Btw: What are you using to process the html from the XML dumps? I've been using Sweble for some time.

XOWA is using it's own custom parser. It's based off the MediaWiki PHP code, but there are a lot of divergences (particularly to handle a non-server environment).

I honestly haven't heard of Sweble until now. I took a quick look at the code, and it is promising. Here's stuff I think it does well:

However, for something, like English Wikipedia you'll also need:

Anyway, hope this information is useful. Let me know if you have other questions. Thanks!

oakenshield commented 8 years ago

Hello, thank you- using GZInputStream over a ByteArrayInputStream worked perfectly. I started to write some regexp to rewrite the compression of the HTML-Tags but that can't really be the right way :-). Maybe Iam better off using the normally-compressed version. I'd love to discuss this general topic more in detail. Could you contact me? Contact on: https://hucompute.org/team/rudiger-gleim/

gnosygnu commented 8 years ago

Sure! I'll send you a quick email now.