tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
94 stars 23 forks source link

Performance improvement ideas #37

Closed xxyzz closed 1 year ago

xxyzz commented 1 year ago
CREATE TABLE pages (
    title TEXT,
    namespace_id INTEGER,
    redirect_to TEXT,
    need_pre_expend INTEGER,
    body TEXT,
    model TEXT,
    PRIMARY KEY (title, namesapce_id)
);

Welcome any suggestions and comments.

kristian-clausal commented 1 year ago

Because bzcat is accessed through a pipe, we can create a command-line parameter to change what command to use when unzipping. lbzcat is not installed by default at least in Mint (and I assume other Ubuntu/Debian-based distros). It's not a module, so I'm not sure whether you could even install it in a venv with pip like we usually do.

Additionally, if a user wants to use something other than bzip2 or lbzip2, they could do it easier with a command-line parameter than having to hack something like an alias bzcat=.

===

I hadn't even realized Tatu made his own parser in dumpparser.py. That's kind of amazing, but using a library built on C code purpose-built for just this (iterparse seems perfect for this, what I can glean of the structure of the dump's XML) sounds like it should speed things up.

I don't think people spend that much time parsing the dump file except to generate a cache, though. It takes some time, but not that much time.

=====

As for the database stuff, I don't know enough about databases yet to understand why this is better. It would be helpful to get an explanation about the specific ways it would be better than what we do now: the cache file is extremely simple and the index to it is also simple (just "article" -> "location in cache file", I think), so what does a complicated database bring to the table?

xxyzz commented 1 year ago

I'll use lbzcat if it's available otherwise fallback to bzcat. And here is the commit https://github.com/xxyzz/wikitextprocessor/commit/22e6d79800b5d1f8c2b0dfa2a5550e7a33153371 that use lxml.

I think using SQLite is actually simpler and more efficient than managing a cache file and save all data in memory, the current code tracks the file offset and write buffer memory to that file, that's not simple. Database won't load all the data in memory, it only load the used data from disk. And it could be copied to subprocesses, which would waste lots of RAM. But I haven't profile the memory usage yet because I don't have enough RAM to test.

garfieldnate commented 1 year ago

I would also find a SQLite DB easier to work with, compared to the --pages-dir feature, which is an improvement over the single cache file but still unwieldy. Having hundreds of thousands of files is difficult when you need to look through the data at all; my terminal will freeze for long periods of time if I ever ls, which also happens if I hit tab to get completions (which is automatic for me). Deleting this directory also takes a long time, and searching takes some work, while both of these would be super simple in a SQLite DB.

I've never tried to open the huge directories in Finder because I'm afraid of freezing my computer. If the data were in SQLite I would open it with DB Browser, and I would search and filter and explore with absolutely no performance issues. It would be much easier to work with.

kristian-clausal commented 1 year ago

The directory in --pages-dir contains text files with un-parsed wikitext stuff ripped straight out of the XML dump file; it's not really the output of wikitextprocessor or wiktextract, it's the (modified, cached, saved) input. Searching through it can be useful for dev stuff, but wikitextprocessor is not meant to create databases of wikitext. That sounds like a different side-project.

I use wikt-cache on beefy machines here at work, and --override on specific files I want to modify. I also spend a lot of time looking at the history of articles on wiktionary to see what changes could have broken something, which ends up as my main way to look at the wikitext sources. Admittedly, I don't do a lot of grepping in the pages directory, and only navigate it using tab-completion.

kristian-clausal commented 1 year ago

Tatu gives the ok to try to migrate wikt-cache and wikt-cache.pickle to some kind of database solution. pages-dir is only a debug tool for developers and shouldn't be touched; that's just a handy collection of files to access when parsing individual pages. The functions related to wikt-cache should all be in one place and it shouldn't be too difficult to replace them.

Keep in mind that if there is no wikt-cache, it is recreated every time wiktwords is run; I assume this is the expected behavior even with a database, creating a temporary database for each run without a previously created database as a parameter.

xxyzz commented 1 year ago

But it would be unnecessary to save pages to files because all page texts will be stored in the database and the texts in pages-dir are currently copied from the huge cache file.

kristian-clausal commented 1 year ago

The saved pages are just a side thing, apparently, and are not used except as the source from where you can get individual pages for the page parameter. That is, everything else is always from the wikt-cache (or override=). The pages directory are for when you need to get an "address" (pages/Words/ad/address.txt) to a specific article EDIT:for command-line purposes:TIDE and for grepping for debugging purposes. Everything else is just wikt-cache.

garfieldnate commented 1 year ago

I would also see the features as redundant. If I had a SQLite DB with all the pages, I would search in there rather than grepping millions of files.

wikitextprocessor is not meant to create databases of wikitext. That sounds like a different side-project.

I'm not looking to create a SQLite DB for other uses. While trying to add features and tests to Wiktextract itself I needed to search for articles with specific features to try, and re-ran the tool on specific pages repeatedly during development. The DB would make development easier.

The directory in --pages-dir contains text files with un-parsed wikitext stuff ripped straight out of the XML dump file; it's not really the output of wikitextprocessor or wiktextract, it's the (modified, cached, saved) input.

If the data is different from what's stored in the cache file, then I would just add another table to the DB and put it there. The DB can take care of both things. It's designed to make it easy to store, retrieve and search huge amounts of structured data, so it's an ideal application.

kristian-clausal commented 1 year ago

You already have the approval to do whatever you think is necessary with replacing wikt-cache with a database. Just leave the --pages-dir as is for other who do use it.

xxyzz commented 1 year ago

I have migrated the code to use SQLite in the sqlite branch: https://github.com/xxyzz/wikitextprocessor/tree/sqlite, it's created from the lua5.1 branch.

All tests have passed, I'll change the wiktextract code then test the wiktwords command.

kristian-clausal commented 1 year ago

If you commit the wiktwords changes to your repo I can also do testing.

xxyzz commented 1 year ago

I pushed the changes to https://github.com/xxyzz/wiktextract/tree/sqlite

But wiktextract/tests/test_long.py gets a strange no such table SQL error, I'm not sure what went wrong.

xxyzz commented 1 year ago

The SQL error is fixed. I'm running the wiktwords command, it's still in the first extract phase.

wiktwords --all --all-languages --out data.json --db-path db --override wiktextract/overrides enwiktionary-20230501-pages-articles.xml.bz2 |& tee log
xxyzz commented 1 year ago

The SQLite insert speed is kind slow. I tested yesterday, it inserted about 2000 rows(pages)/second, and it took around 50 minutes to finish the first phase.

I have fixed some errors in the first phase code, now the memory usage of the second phase is still increasing over time but slower then before.

xxyzz commented 1 year ago

sqlalchemy's upsert method is very slow, maybe I didn't use this method correctly. The insert speed improved a lot after this change: https://github.com/xxyzz/wikitextprocessor/commit/05b16b1f31247fb85d83712233a55d94408999fb

xxyzz commented 1 year ago

I think I have fixed most errors and both sqlite branches are ready to be reviewed.

kristian-clausal commented 1 year ago

Did you delete the pull request? There's only the lua5.1 branch visible as a pull request currently.

xxyzz commented 1 year ago

I didn't create pull request, because both branches(xxyzz/wikitextprocessor/tree/sqlite, xxyzz/wiktextract/tree/sqlite) are created from the branches in my other two open pull requests(lua5.1 and pyproject branch). And those lua5.1 pull requests are blocked by my Lupa pull request(or new lupa release).

You could see the changes from these links:

And compare with lua5.1 branches:

I have rebased both lua5.1 and pyproject branch to the master/main branch.

xxyzz commented 1 year ago

I have created two pull requests: https://github.com/tatuylonen/wikitextprocessor/pull/41 and https://github.com/tatuylonen/wiktextract/pull/248

jmviz commented 1 year ago

Great work. Is there a rough idea of the performance differences?

xxyzz commented 1 year ago

I didn't wait for the command to finish running, here is the estimate remaining time comparison:

master branch: ... 35776/8447283 pages (0.4%) processed, 38:08:30 remaining sqlite branch: ... 32670/7469140 pages (0.4%) processed, 29:39:46 remaining

kristian-clausal commented 1 year ago

That is a huge improvement! Kaikki seems to be regenerating ok (either because of the override on xnt-decl or because I fixed the issue with git...), and it took around 5 hours to finish the wiktextract run, so if all goes well it could drop down to 4, 3½ hours (most of the time is taken by the site generation itself; taking the data and doing some post-processing + html generation).

I am tempted to merge the sqlite branch and fix the issues with xnt-decl poisoning the globals (even though they should be reset) after that...

kristian-clausal commented 1 year ago

I will also report that processing individual pages used to have a small pause at the beginning (99% certainly because of file-operation stuff) that is basically gone now.

kristian-clausal commented 1 year ago

Seems like everything is completed from the original post, so I'll close this issue. Great work!