Calculate Page and Image Sizes

andrewreed commented 3 years ago

I've recently started using XOWA in my research, and there's a few things that I'd like to be able to do.

Overall, I'd like to determine the total size (in bytes) of a page, i.e. (1) the size of the HTML for a page and (2) the size of each referenced image.

More specifically, I was wondering if you could provide me with some advice/pointers for the following:

Pages I see that you offer a .xowa file with pre-rendered HTML pages, and I see that I could simply iterate through the SQLite file and unzip each page to determine the size of each page's HTML. Ideally, though, I'd like to work from the wikitext version so that I can recompute a page's size when an article is updated. My question is: How would I take a page_id and render the associated text_data into HTML? Is there a function in XOWA's source code that you could refer me to?
Images Given a fil_name from "...file-core.xowa", how would I locate the associated bin_data in "...ns.000-db.001.xowa"? I'm having trouble figuring out how an entry in "...file-core.xowa" maps to an image in "...ns.000-db.001.xowa".

Right now, I could accomplish all of the above with a Python script that extracts article names from "core.xowa" and then uses the Requests library to retrieve every article and image from XOWA's HTTP server. However, this will be painfully slow, even with the server and scraper running on the same machine. Thus, I'm hoping to be able to directly leverage the SQLite files and a few of XOWA's functions for these tasks.

gnosygnu commented 3 years ago

Hey, thanks for the interest in XOWA as well as the detailed issue.

Regarding your questions

I see that you offer a .xowa file with pre-rendered HTML pages, and I see that I could simply iterate through the SQLite file and unzip each page to determine the size of each page's HTML. Ideally, though, I'd like to work from the wikitext version so that I can recompute a page's size when an article is updated.

If you need the size of a page, this is stored in en.wikipedia.org-core.xowa. Open it up with SQLite and run something like SELECT page_id, page_title, page_len FROM page LIMIT 10; . The page_len gives the wikitext length of the article

sqlite> SELECT page_id, page_title, page_len FROM page LIMIT 10;
page_id     page_title                                page_len
----------  ----------------------------------------  ----------
10          AccessibleComputing                       94
12          Anarchism                                 84783
13          AfghanistanHistory                        90
14          AfghanistanGeography                      92
15          AfghanistanPeople                         95
18          AfghanistanCommunications                 97
19          AfghanistanTransportations                113
20          AfghanistanMilitary                       88
21          AfghanistanTransnationalIssues            101
23          AssistiveTechnology                       88

My question is: How would I take a page_id and render the associated text_data into HTML? Is there a function in XOWA's source code that you could refer me to?

So XOWA provides 2 types of files

Wikitext stored in -text.xowa databases
HTML stored in -html.xowa databases

You can download the latest from https://archive.org/details/Xowa_enwiki_2020-08

If you want to work with the wikitext and convert that to HTML, then you'd have to run the XOWA parser. There's a command-line option to run it, but it's a bit slow. There's also an API way to do so and you'd be working with a class called Xop_mediawiki_mgr

Ideally though, you should just take the HTML database and run gunzip on the blobs within. This will be much faster.

For more info, you can also look at a similar issue: https://github.com/gnosygnu/xowa/issues/739#issuecomment-639468117

Given a fil_name from "...file-core.xowa", how would I locate the associated bin_data in "...ns.000-db.001.xowa"? I'm having trouble figuring out how an entry in "...file-core.xowa" maps to an image in "...ns.000-db.001.xowa".

So there are two main tables in file-core.xowa

fsdb_fil: The "original" table (full-sized image)
fsdb_thm: The thumb table (the thumbnail that shows in most articles)

Both these fields have a column called _bin_db_id.

-1 means that the row is that the image isn't available (the row exists for informational purposes)
Any other value indicates the dbb_uid in fsdb_dbb

fsdb_dbb also has a file name. So you can open that file and cross-reference the bin_owner_id to pull the bin_data.

Note that there is also bin_owner_tid wherein:

1 indicates the owner_id is an original
2 indicates the owner_id is a thumb

andrewreed commented 3 years ago

Thanks! This definitely helps.

Pages After reviewing your response, as well as issue #739, it appears that my best bet for determining page sizes will be to use wget to retrieve the pages from the HTTP server. The reason is, I need the size of the entire HTML response, whereas the zipped version of each page appears to be just the body of each article. Still, retrieval via wget is much faster when XOWA is using the pre-rendered HTML.

However, when using wget, the XOWA HTTP server froze with a "Java heap space" error after 50565 articles had been retrieved.

Questions:

Do you know if increasing the heap space will help, or will XOWA's memory usage continue to grow?
Should I frequently restart XOWA after several thousand retrievals?
Also, is there a subdirectory that I should delete with each restart?

Images Images look pretty straightforward. Based on your response, it appears that I can first search for an image name in the fsdb_fil table, get that fil_id, and then add 1 to the fil_id to find the matching thumbnail version in the fsdb_thm table. Once I have the correct thumb version in fsdb_thm, the thm_size equates to the file size, so I do not need to retrieve the actual image.

gnosygnu / xowa

Calculate Page and Image Sizes #822