Closed xxyzz closed 1 year ago
Because bzcat
is accessed through a pipe, we can create a command-line parameter to change what command to use when unzipping. lbzcat
is not installed by default at least in Mint (and I assume other Ubuntu/Debian-based distros). It's not a module, so I'm not sure whether you could even install it in a venv with pip like we usually do.
Additionally, if a user wants to use something other than bzip2
or lbzip2
, they could do it easier with a command-line parameter than having to hack something like an alias bzcat=
.
===
I hadn't even realized Tatu made his own parser in dumpparser.py. That's kind of amazing, but using a library built on C code purpose-built for just this (iterparse seems perfect for this, what I can glean of the structure of the dump's XML) sounds like it should speed things up.
I don't think people spend that much time parsing the dump file except to generate a cache, though. It takes some time, but not that much time.
=====
As for the database stuff, I don't know enough about databases yet to understand why this is better. It would be helpful to get an explanation about the specific ways it would be better than what we do now: the cache file is extremely simple and the index to it is also simple (just "article" -> "location in cache file", I think), so what does a complicated database bring to the table?
I'll use lbzcat
if it's available otherwise fallback to bzcat
. And here is the commit https://github.com/xxyzz/wikitextprocessor/commit/22e6d79800b5d1f8c2b0dfa2a5550e7a33153371 that use lxml.
I think using SQLite is actually simpler and more efficient than managing a cache file and save all data in memory, the current code tracks the file offset and write buffer memory to that file, that's not simple. Database won't load all the data in memory, it only load the used data from disk. And it could be copied to subprocesses, which would waste lots of RAM. But I haven't profile the memory usage yet because I don't have enough RAM to test.
I would also find a SQLite DB easier to work with, compared to the --pages-dir
feature, which is an improvement over the single cache file but still unwieldy. Having hundreds of thousands of files is difficult when you need to look through the data at all; my terminal will freeze for long periods of time if I ever ls
, which also happens if I hit tab to get completions (which is automatic for me). Deleting this directory also takes a long time, and searching takes some work, while both of these would be super simple in a SQLite DB.
I've never tried to open the huge directories in Finder because I'm afraid of freezing my computer. If the data were in SQLite I would open it with DB Browser, and I would search and filter and explore with absolutely no performance issues. It would be much easier to work with.
The directory in --pages-dir contains text files with un-parsed wikitext stuff ripped straight out of the XML dump file; it's not really the output of wikitextprocessor or wiktextract, it's the (modified, cached, saved) input. Searching through it can be useful for dev stuff, but wikitextprocessor is not meant to create databases of wikitext. That sounds like a different side-project.
I use wikt-cache on beefy machines here at work, and --override on specific files I want to modify. I also spend a lot of time looking at the history of articles on wiktionary to see what changes could have broken something, which ends up as my main way to look at the wikitext sources. Admittedly, I don't do a lot of grepping in the pages directory, and only navigate it using tab-completion.
Tatu gives the ok to try to migrate wikt-cache and wikt-cache.pickle to some kind of database solution. pages-dir is only a debug tool for developers and shouldn't be touched; that's just a handy collection of files to access when parsing individual pages. The functions related to wikt-cache should all be in one place and it shouldn't be too difficult to replace them.
Keep in mind that if there is no wikt-cache, it is recreated every time wiktwords is run; I assume this is the expected behavior even with a database, creating a temporary database for each run without a previously created database as a parameter.
But it would be unnecessary to save pages to files because all page texts will be stored in the database and the texts in pages-dir
are currently copied from the huge cache file.
The saved pages are just a side thing, apparently, and are not used except as the source from where you can get individual pages for the page parameter. That is, everything else is always from the wikt-cache (or override=). The pages directory are for when you need to get an "address" (pages/Words/ad/address.txt) to a specific article EDIT:for command-line purposes:TIDE and for grepping for debugging purposes. Everything else is just wikt-cache.
I would also see the features as redundant. If I had a SQLite DB with all the pages, I would search in there rather than grepping millions of files.
wikitextprocessor is not meant to create databases of wikitext. That sounds like a different side-project.
I'm not looking to create a SQLite DB for other uses. While trying to add features and tests to Wiktextract itself I needed to search for articles with specific features to try, and re-ran the tool on specific pages repeatedly during development. The DB would make development easier.
The directory in --pages-dir contains text files with un-parsed wikitext stuff ripped straight out of the XML dump file; it's not really the output of wikitextprocessor or wiktextract, it's the (modified, cached, saved) input.
If the data is different from what's stored in the cache file, then I would just add another table to the DB and put it there. The DB can take care of both things. It's designed to make it easy to store, retrieve and search huge amounts of structured data, so it's an ideal application.
You already have the approval to do whatever you think is necessary with replacing wikt-cache with a database. Just leave the --pages-dir as is for other who do use it.
I have migrated the code to use SQLite in the sqlite branch: https://github.com/xxyzz/wikitextprocessor/tree/sqlite, it's created from the lua5.1 branch.
All tests have passed, I'll change the wiktextract code then test the wiktwords
command.
If you commit the wiktwords changes to your repo I can also do testing.
I pushed the changes to https://github.com/xxyzz/wiktextract/tree/sqlite
But wiktextract/tests/test_long.py
gets a strange no such table
SQL error, I'm not sure what went wrong.
The SQL error is fixed. I'm running the wiktwords command, it's still in the first extract phase.
wiktwords --all --all-languages --out data.json --db-path db --override wiktextract/overrides enwiktionary-20230501-pages-articles.xml.bz2 |& tee log
The SQLite insert speed is kind slow. I tested yesterday, it inserted about 2000 rows(pages)/second, and it took around 50 minutes to finish the first phase.
I have fixed some errors in the first phase code, now the memory usage of the second phase is still increasing over time but slower then before.
sqlalchemy's upsert method is very slow, maybe I didn't use this method correctly. The insert speed improved a lot after this change: https://github.com/xxyzz/wikitextprocessor/commit/05b16b1f31247fb85d83712233a55d94408999fb
I think I have fixed most errors and both sqlite branches are ready to be reviewed.
Did you delete the pull request? There's only the lua5.1 branch visible as a pull request currently.
I didn't create pull request, because both branches(xxyzz/wikitextprocessor/tree/sqlite, xxyzz/wiktextract/tree/sqlite) are created from the branches in my other two open pull requests(lua5.1 and pyproject branch). And those lua5.1 pull requests are blocked by my Lupa pull request(or new lupa release).
You could see the changes from these links:
And compare with lua5.1 branches:
I have rebased both lua5.1 and pyproject branch to the master/main branch.
I have created two pull requests: https://github.com/tatuylonen/wikitextprocessor/pull/41 and https://github.com/tatuylonen/wiktextract/pull/248
Great work. Is there a rough idea of the performance differences?
I didn't wait for the command to finish running, here is the estimate remaining time comparison:
master branch: ... 35776/8447283 pages (0.4%) processed, 38:08:30 remaining
sqlite branch: ... 32670/7469140 pages (0.4%) processed, 29:39:46 remaining
That is a huge improvement! Kaikki seems to be regenerating ok (either because of the override on xnt-decl or because I fixed the issue with git...), and it took around 5 hours to finish the wiktextract run, so if all goes well it could drop down to 4, 3½ hours (most of the time is taken by the site generation itself; taking the data and doing some post-processing + html generation).
I am tempted to merge the sqlite branch and fix the issues with xnt-decl poisoning the globals (even though they should be reset) after that...
I will also report that processing individual pages used to have a small pause at the beginning (99% certainly because of file-operation stuff) that is basically gone now.
Seems like everything is completed from the original post, so I'll close this issue. Great work!
bzcat
with multi-threadslbzcat
(https://github.com/kjn/lbzip2) at here: https://github.com/tatuylonen/wikitextprocessor/blob/e0b808d89ff2feb059cc37ebdbb0fafbf215c4e8/wikitextprocessor/dumpparser.py#L184-L189iterparse()
) to parse the dump file. Also in theprocess_input()
function linked above and we could ignore some namespaces at here.Wtp.add_page()
andWtp.read_by_tile()
) and enable multiprocessing in theprocess_dump()
function(phase 1). Table schema:Welcome any suggestions and comments.