tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
94 stars 23 forks source link

--categories-file is currently broken #45

Closed kristian-clausal closed 1 year ago

kristian-clausal commented 1 year ago

wiktwords --all-languages --all --db-path wikt-db --pages-dir pages --categories-file categories-test.json dumps/enwiktionary-20230420-pages-articles.xml.bz2

Testing out creating a database file and pages directory, resulting in:

Emitting thesaurus main entry for तडित्/Sanskrit/noun (not in main)
Emitting thesaurus main entry for linguist/English/noun (not in main)
Emitting thesaurus main entry for combining form/English/noun (not in main)
2023-05-17 09:13:29,441 INFO: Reprocessing wiktionary complete
Extracting category tree
Traceback (most recent call last):
  File "/home/kristian/.local/bin/wiktwords", line 8, in <module>
    sys.exit(main())
  File "/home/kristian/Repos/wiktextract/wiktextract/wiktwords.py", line 360, in main
    tree = extract_categories(ctx, config)
  File "/home/kristian/Repos/wiktextract/wiktextract/categories.py", line 75, in extract_categories
    ctx.add_page(f"{module_ns_local_name}:wiktextract cat tree",
  File "/home/kristian/Repos/wikitextprocessor/wikitextprocessor/core.py", line 532, in add_page
    self.db_conn.execute("""INSERT INTO pages (title, namespace_id, body,
sqlite3.ProgrammingError: Cannot operate on a closed database.

Looking at what wiktwords is actually doing there, it's the --category-file parameter that was left over when ctrl-R'ed for this command in my history. ctx.add_page() needs to be checked to see if it is being called on a closed database like here, but the full run on the kaikki regen machine seems to be running fine so hopefully kaikki will regenerate well.

kristian-clausal commented 1 year ago

Looking at our kaikki regen script, wiktwords is run with --categories-file so the process will crash and kaikki will not regenerate.

xxyzz commented 1 year ago

https://github.com/xxyzz/wiktextract/commit/e7602502a5bab97412bef7f2dfffef5ee2e21d94 fixes this error. I'm also fixing --modules-file and --templates-file options. I'll create a pull request later.

kristian-clausal commented 1 year ago

Thanks, I was worrying that it would be a bigger thing with checking for db connections in add_page or similar.