Open andrewreed opened 3 years ago
Hey, thanks for the interest in XOWA as well as the detailed issue.
Regarding your questions
I see that you offer a .xowa file with pre-rendered HTML pages, and I see that I could simply iterate through the SQLite file and unzip each page to determine the size of each page's HTML. Ideally, though, I'd like to work from the wikitext version so that I can recompute a page's size when an article is updated.
If you need the size of a page, this is stored in en.wikipedia.org-core.xowa
. Open it up with SQLite and run something like SELECT page_id, page_title, page_len FROM page LIMIT 10;
. The page_len
gives the wikitext length of the article
sqlite> SELECT page_id, page_title, page_len FROM page LIMIT 10;
page_id page_title page_len
---------- ---------------------------------------- ----------
10 AccessibleComputing 94
12 Anarchism 84783
13 AfghanistanHistory 90
14 AfghanistanGeography 92
15 AfghanistanPeople 95
18 AfghanistanCommunications 97
19 AfghanistanTransportations 113
20 AfghanistanMilitary 88
21 AfghanistanTransnationalIssues 101
23 AssistiveTechnology 88
My question is: How would I take a page_id and render the associated text_data into HTML? Is there a function in XOWA's source code that you could refer me to?
So XOWA provides 2 types of files
-text.xowa
databases-html.xowa
databasesYou can download the latest from https://archive.org/details/Xowa_enwiki_2020-08
If you want to work with the wikitext and convert that to HTML, then you'd have to run the XOWA parser. There's a command-line option to run it, but it's a bit slow. There's also an API way to do so and you'd be working with a class called Xop_mediawiki_mgr
Ideally though, you should just take the HTML database and run gunzip
on the blobs within. This will be much faster.
For more info, you can also look at a similar issue: https://github.com/gnosygnu/xowa/issues/739#issuecomment-639468117
Given a fil_name from "...file-core.xowa", how would I locate the associated bin_data in "...ns.000-db.001.xowa"? I'm having trouble figuring out how an entry in "...file-core.xowa" maps to an image in "...ns.000-db.001.xowa".
So there are two main tables in file-core.xowa
Both these fields have a column called _bin_db_id
.
-1
means that the row is that the image isn't available (the row exists for informational purposes)dbb_uid
in fsdb_dbb
fsdb_dbb
also has a file name. So you can open that file and cross-reference the bin_owner_id
to pull the bin_data
.
Note that there is also bin_owner_tid
wherein:
1
indicates the owner_id
is an original2
indicates the owner_id
is a thumbThanks! This definitely helps.
Pages After reviewing your response, as well as issue #739, it appears that my best bet for determining page sizes will be to use wget to retrieve the pages from the HTTP server. The reason is, I need the size of the entire HTML response, whereas the zipped version of each page appears to be just the body of each article. Still, retrieval via wget is much faster when XOWA is using the pre-rendered HTML.
However, when using wget, the XOWA HTTP server froze with a "Java heap space" error after 50565 articles had been retrieved.
Questions:
Images Images look pretty straightforward. Based on your response, it appears that I can first search for an image name in the fsdb_fil table, get that fil_id, and then add 1 to the fil_id to find the matching thumbnail version in the fsdb_thm table. Once I have the correct thumb version in fsdb_thm, the thm_size equates to the file size, so I do not need to retrieve the actual image.
I've recently started using XOWA in my research, and there's a few things that I'd like to be able to do.
Overall, I'd like to determine the total size (in bytes) of a page, i.e. (1) the size of the HTML for a page and (2) the size of each referenced image.
More specifically, I was wondering if you could provide me with some advice/pointers for the following:
Pages I see that you offer a .xowa file with pre-rendered HTML pages, and I see that I could simply iterate through the SQLite file and unzip each page to determine the size of each page's HTML. Ideally, though, I'd like to work from the wikitext version so that I can recompute a page's size when an article is updated. My question is: How would I take a
page_id
and render the associatedtext_data
into HTML? Is there a function in XOWA's source code that you could refer me to?Images Given a
fil_name
from "...file-core.xowa", how would I locate the associatedbin_data
in "...ns.000-db.001.xowa"? I'm having trouble figuring out how an entry in "...file-core.xowa" maps to an image in "...ns.000-db.001.xowa".Right now, I could accomplish all of the above with a Python script that extracts article names from "core.xowa" and then uses the Requests library to retrieve every article and image from XOWA's HTTP server. However, this will be painfully slow, even with the server and scraper running on the same machine. Thus, I'm hoping to be able to directly leverage the SQLite files and a few of XOWA's functions for these tasks.