Closed rusg77 closed 9 months ago
I'll try to take a look at this next week. Please also give more context around test_simplepage(self) (is this from a test file?), and whatever output you get. Do the ....
mean you actually get output around the templates, for example, etc.
Thanks @kristian-clausal !
Please also give more context around test_simplepage(self) (is this from a test file?)
I just tried to run my code as test, but it doesn't matter. So the code actually:
from functools import partial
from pathlib import Path
from typing import Any
from wikitextprocessor import Wtp, Page
from wikitextprocessor.dumpparser import process_dump
def page_handler(page: Page, wtp: Wtp | None = None) -> Any:
wtp.start_page(page.title)
node = wtp.parse(page.body, pre_expand=True)
value = wtp.node_to_wikitext(node)
print(value)
if __name__ == '__main__':
wtp = Wtp(db_path=Path('../db/mydb'))
process_dump(
wtp,
"./Wikipedia-20230825082710.xml.bz2",
{0, 10, 110, 828}, # namespace id
save_pages_path=Path('../debug')
)
print("Num pages:", wtp.saved_page_nums())
for _ in map(
partial(page_handler, wtp=wtp), wtp.get_all_pages([0])
):
pass
Do the .... mean you actually get output around the templates, for example, etc.
Yes, I got. The output is just a bit long, so I decided to simplify it. Here is the full output. Then main problem is that all the templates are <strong class="error">Template:val</strong>
Here the full output:
Thanks, I was able to reproduce the errors, but as said I'll have to take a look at it first next week when I'm back to work.
A preliminary look is that... Ok, this is newer stuff, the process fails to load the modules because there have been recent additions to the function parameters concerning namespace ids, etc.
The parser needs to know the namespace ids, which are provided by a data/[lang_code]/namespace.json file.
You're parsing Wikipedia, which means the namespace ids are possibly different from the ones on en.wiktionary.org, although the Modules id was the same.
So at least there is a need for a namespace.json file in an appropriate data folder. However, those are hard-coded, so I don't think we have en.wikipedia.org specific data folders, because we've only been working with Wiktionary stuff, which is in hindsight an oversight.
@xxyzz is this about correct? Do you think using the Wiktionary namespaces.json is good enough for this?
And NOW, I'm really off to the weekend!
Thanks! Have a great weekend!
The template and module pages can't be found is because English Wikipedia also use upper case in the first letter in page title but use lower case in wikitext, this could be fixed similar to the Chinese Wiktionary.
Then I get [string "Module:val"]:775: bad argument #3 to 'format' (string expected, got userdata)
error.
The Lua error I posted above is because some functions added at here https://github.com/tatuylonen/wikitextprocessor/blob/e5296c16f2d715e62121f23cb5057374da48cda3/wikitextprocessor/luaexec.py#L708
can't be passed to string.format
(https://en.wikipedia.org/wiki/Module:Val#L-775)
Thanks for agreeing to look into this @kristian-clausal -- FYI there will be a lot of people looking out for this because there's an ongoing Kaggle competition at the moment that needs clean wikipedia data!
"Module:Arguments" checks type(frame.getParent)
is "function"
at here: https://en.wikipedia.org/wiki/Module:Arguments#L-97 but because getParent
is Python function, its type is userdata
. An easy fix would be overwrite the "Module:Arguments" page in the sqlite db file and change the line 97 to:
if type(frame.args) == 'table' then
https://github.com/tatuylonen/wikitextprocessor/pull/89/commits/4bc34c4db234098c9542161d27ebb37afb5295da and https://github.com/tatuylonen/wikitextprocessor/pull/89/commits/e3614fd13215ab9cebf8b2dda6f5c50419165ec5 are also needed.
Thanks for agreeing to look into this @kristian-clausal -- FYI there will be a lot of people looking out for this because there's an ongoing Kaggle competition at the moment that needs clean wikipedia data!
Here's two pitfalls with wikitextprocessor you might want to keep in mind:
In conclusion, if you don't need to do the kind of stuff we need for wiktextract (wikitext -> parse tree with template nodes that can be processed piecemeal), you might just be better off by simply downloading a Wikipedia dump file with pre-expanded HTML pages. IIRC, there's even pages with metadata about the expansion of templates. If you're doing just expansions of templates (which would ultimately create a clean page without templates that the parser should be able to parse without problems as long as problem #2 doesn't affect it), then it's almost the same as just downloading the HTML dump, ultimately. Don't know if there's a dump with template-less wikitext source anywhere.
@xxyzz how is it going with the stuff in the PR? If things might conflict, I could assign this to you to poke at. You know more about the namespace id stuff, too.
It's not about the namespace id because Wiktionary and Wikipedia use the same namespace id for template and module, and if there are any difference, the namespace JSON file can be updated by running the get_namespaces.py file.
The two commits I made are for handling the case of page title, but they won't fix the Lua error I posted above: https://github.com/tatuylonen/wikitextprocessor/issues/90#issuecomment-1693151176 . To fix this error, I think there are two methods:
Module:Arguments
, I have tested the wikitext {{val|879.6|0.8|u=[[second|s]]}}
can be expanded, and we could consider the issue resolved.getParent
inside a Lua function or create a Lua function that returns a Python object but I don't know how this could be implemented.It's possible to fix the error with change to the make_frame()
in file "lauexec.py" like this:
- frame["getParent"] = debugGetParent
+ frame["parent_frame"] = pframe
+ frame["getParent"] = lua.eval(
+ "function(frame) return frame.parent_frame end"
+ )
but I don't know whether it'll affect other Lua code by adding a new key value pair to the frame table and it also removes the debug function you added before.
I was just looking at trying to wrap the 'debug' function inside a lua function. The "debug" names are just because I wanted to replace the lambdas that were previously there with named functions (so the only reason why they're named functions is for debugging). It might be possible to pass debugGetParents to a lua-function that itself returns a lua function that calls debugGetParents... 😮💨
~Those test functions can't be passed to Lua functions because the Lua function has to have the same parameters as the Python function.~(I forget *args
parameter in Python code) And those test(debug) functions are just added some print
s, that can be written in Lua. And do we still need those print
s?
Yes, those prints are still needed because we haven't fixed the bug yet.
It might be possible to do something like:
def debugGetParent(frame: "_LuaTable", *args) -> "_LuaTable":
if args:
print(
f"LAMBDA GETPARENT DEBUG (title: {title}): {repr(args)}"
f", process: {multiprocessing.current_process().name}"
)
if TYPE_CHECKING:
assert isinstance(pframe, _LuaTable)
return pframe
lua_getp_generator = lua.eval("""
function(py_func)
wrapper_func = function(x)
return py_func(x)
end
return wrapper_func
end
""")
wrappedDebugGetParent = lua_getp_generator(debugGetParent)
...
frame["getParent"] = wrappedDebugGetParent
frame["getTitle"] = debugGetTitle
...
Although I haven't yet figure out how to test this with the Wikipedia stuff. It passed the normal tests.
Why not just return the Python function py_func
? That Lua code only accept a single parameter.
Because if we return the py_func, it will have the type userdata
; that was the original problem, wasn't it? EDIT: To specify, lua's type() will unwrap the py_func and return userdata
. If we wrap it in a lua function, it will return function
... hopefully. I think so, based on testing it in interactive mode.
EDIT: I don't really know why these functions are python functions. I'll ask Tatu for the reason. It might just be simpler to turn them in lua functions.
EDIT2: Iirc, Lua functions just ignore if you feed them too many parameters. The *args is there so that we can check if the function is getting too many arguments, then print the debug line.
Return py_func
directly doesn't work, type
will still return userdata
. And because Lua ignores extract parameter, the print
in Python function become useless...
The lua wrapper function above + xxyzz/node seems to print the templates:
== Bibliography ==
*{{cite journal
|author=Ерозолимский, Б.Г.
|year=1975
|title=Бета-распад нейтрона
|trans-title=Neutron beta decay
|journal=Успехи физических наук
|volume=116 |issue=1 |pages=145–164
|url=http://ufn.ru/ru/articles/1975/5/e/
}}
[[Category:Neutron]]
[[Category:Radioactivity]]
[[Category:Physical phenomena]]
But running Wtp.parse() with expand_all=True
is still broken. EDIT: Actually, most of the simple templates seem to be expanded:
The hard-to-observe {{math| {{SubatomicParticle|W boson-}} }} quickly decays into an [[electron]]
->
The hard-to-observe <span class="texhtml+"> <span style="white-space%3Anowrap%3B"><span style="display%3Ainline-block%3Bfont-size%3A80%25%3Bline-height%3A1.0em%3Bmargin-bottom%3A-0.3em%3Btext-align%3Aright%3Bvertical-align%3A0.8em%3B"><sup style="font-size%3Ainherit%3Bline-height%3Ainherit%3Bvertical-align%3Abaseline%3B"><br><sub style="font-size%3Ainherit%3Bline-height%3Ainherit%3Bvertical-align%3Abaseline%3B"></sub></sup></span>W<span style="display%3Ainline-block%3Bfont-size%3A80%25%3Bline-height%3A1.0em%3Bmargin-bottom%3A-0.3em%3Btext-align%3Aleft%3Bvertical-align%3A0.8em%3B"><sup style="font-size%3Ainherit%3Bline-height%3Ainherit%3Bvertical-align%3Abaseline%3B">−</sup><br><sub style="font-size%3Ainherit%3Bline-height%3Ainherit%3Bvertical-align%3Abaseline%3B"></sub></span></span> </span> quickly decays into an [[electron]]
Ignore the above, it works because of xxyzz's branch. MessageBox is still broken, which was the problem.
Some Lua Modules are broken because the Scribunto code in the lua
folder is too old and doesn't have the strict.lua
file. Because it not a git submodule we don't know it's version(git commit). I add a Scribunto git submodule at the "strict" branch: https://github.com/xxyzz/wikitextprocessor/tree/strict
This still has Lua error but I don't have time to fix it today... and I haven't test if pip install works. The server install script also need to update if git submodule is used(just add 2 git clone options).
I went back to try the code I posted up again, and it does actually remove the errors with userdata
leaking through. It was just that Message box failed because it couldn't be loaded, which I'm trying to figure out. Haven't tried the strict
branch yet because the tests seem to be failing.
Yeah, I'm using your wrap Lua code to get around the Lua type check error. Updating Scribunto
code is for fixing import errors in many Lua modules that import the strict
module at the start and my branch currently doesn't work...
And I not sure if you notice that the Lua wrap function ignores extra parameters so the print
lines in debug Python functions will never be used.
I keep on forgetting to make it whatever Lua's version of *args is. ...
and args = {...}
etc. except can't remember which version Lua 5.1 supports and then I sit down and forget again. In this spirit, I will leave it for tomorrow!
I think I have fixed the Lua error caused by "strict.lua" file but there are many errors in "Module:Citation/CS1/Configuration", for example:
uncategorized_namespaces_t
at line 32, should be local
editor_markup_patterns
at line 2350maint_cats
at line 2360And after fixing errors in Module:Citation/CS1/Configuration
by overwrite this page, I get two new errors:
[string "Module:Citation/CS1/Date_validation"]:51: attempt to compare number with nil
[string "Module:Citation/CS1/Utilities"]:82: attempt to index field 'message' (a nil value)
I keep on forgetting to make it whatever Lua's version of *args is.
...
andargs = {...}
etc.
I found this https://www.lua.org/pil/5.2.html document yesterday, maybe it'll be helpful.
@xxyzz @kristian-clausal
Thanks folks, this is really awesome.
I looked for template expanded dumps that @kristian-clausal mentioned , but I couldn't find any. Do you have any links?
If they don't exist and aren't likely to exist, I'm thinking of writing some lua only code to do expansion of the pages-articles dumps.
My reasoning for lua only is my worry about lupa given some of the comments above by @kristian-clausal about potential memory issues. The integration bug also that @xxyzz observed seems like something that could pop up in other parts when using lupa.
I doubt the memory issues discussed above are in luajit as it's fairly simple and very broadly used. Lua is actually a nice and tidy framework, and can be quite performant. Its lack of 3rd party support is actually a feature in some ways rather than a bug as security is much easier to validate, which is probably why wikipedia chose lua in the first place.
Thoughts? Any tips / suggestions about gotchas I should consider before starting this would help. In particular, I don't want to reinvent something someone else has done or is working on.
It seems that the HTML dumps have been discontinued... For over a decade. They still have a silly big link on the dumps page to a directory with stuff from 2008. In that case, yeah, doing the expanding is necessary.
I found this https://www.lua.org/pil/5.2.html document yesterday, maybe it'll be helpful.
I noticed you folks are using lua51 in the code. I changed it as I couldn't find lupa.lua51 easily.
edit: ok, lulz, that's not lua 52 the link is referring to, however the comment is still relevant!
OK nvm, I see that wikipedia is using 51.
It seems that the HTML dumps have been discontinued... For over a decade. They still have a silly big link on the dumps page to a directory with stuff from 2008. In that case, yeah, doing the expanding is necessary.
New HTML dumps are here: https://dumps.wikimedia.org/other/enterprise_html/runs/ but currently missing many pages.
The integration bug also
The Lua errors I posted before are not related to Lupa or Lua. Just our code doesn't implement some features in MediaWiki.
@xxyzz
From above, this looks like a lupa integration issue:
"Module:Arguments" checks type(frame.getParent) is "function" at here: https://en.wikipedia.org/wiki/Module:Arguments#L-97 but because getParent is Python function, its type is userdata. An easy fix would be overwrite the "Module:Arguments" page in the sqlite db file and change the line 97 to:
if type(frame.args) == 'table' then
I looked at the enterprise stuff only briefly, not very open source. :p Maybe I'm stepping on a profit center for wikipedia, but given this is all user sourced content, I think that's fair.
If it would be easier to figure out how to get wikitextprocessor to first class support pages-articles I am happy to help with that. But you folks would have to agree that it would be willing to do first class support. In particular, performance issues would have to be addressed as it's been a huge stumbling block whenever we do anything with potentially TB sized files (usually just pages-articles, which is 60GB though)
Mutiprocessor support would be an asbolute requirement.
I'd much prefer to leverage existing efforts, but the problem I see right now is nobody is taking the problem seriously. I suspect the lua is the primary stumbling block
Another effort is here -- https://github.com/spencermountain/wtf_wikipedia/blob/643add955f6cf5ed278ef123e7fd31b951842ce6/src/template/custom/text-only/functions.js#L386
Spencer has (sorta) rewritten the lua modules in javascript. :p
Also, one final question, have you folks been able to solve the original problem for this thread?
There is a dropbox file with the xml.bz2 export:
I tried the suggestions @xxyzz made above, but it didn't seem to work. If it works for you, can you paste the the output we should expect to see for the first few paraphs with the val template being used?
Some "Citation/CS1" modules are still broken(see my post: https://github.com/tatuylonen/wikitextprocessor/issues/90#issuecomment-1696730596) because the code doesn't implement mw.message
and mw.language:formatDate
. Other templates in the page "Free neutron decay" should be expended without error.
I don't even know what "first class support" means. In any case, wiktextract (which is what wikitextprocessor was created for) runs using multiprocessing, separating out pages into chunks and processing them separately; data from other pages is accessed through the new SQLite database (thanks to xxyzz).
Hmm, not sure what you mean by expanded.
Here's what I see:
When embedded in an [[atomic nucleus]], [[neutrons]] are (usually) stable particles. Outside the [[atomic nucleus|nucleus]], free [[neutron]]s are unstable and have a [[mean lifetime]] of {{val|879.6|0.8|u=[[second|s]]}} (about {{val|14|u=minutes}}, {{val|39.6|u=seconds}}).<ref name="PDG-2020-n-life" /> Therefore, the [[half-life]] for this process (which differs from the mean lifetime by a factor of {{math|[[Natural logarithm|ln]](2) ≈ 0.693}}) is {{val|611|1|u=s}} (about {{val|10|u=minutes}}, {{val|11|u=seconds}}).<ref name="Beringer-etal-2012-PDG-010001" /><ref name="PDG-2007-baryons-LPL" /> (An article<ref name="Gonzalez-2021" /> published in October 2021, arrives at {{val|877.75|0.50|0.44|u=s}} for the mean lifetime).
I don't even know what "first class support" means. In any case, wiktextract (which is what wikitextprocessor was created for) runs using multiprocessing, separating out pages into chunks and processing them separately; data from other pages is accessed through the new SQLite database (thanks to xxyzz).
Sounds great. I just wanted to make the offer as you folks have done a lot of really great work here. I just think a solution will have to make it a primary priority to do a good job for wikipedia.
We are currently working mainly (only) with Wiktionary stuff because that's where the project started; ideally, wikitextprocessor should of course work with Wikipedia, too, but as we have seen the problem is that even though Scribunto etc. can be the same underlying engine in different Wikimedia projects, the way different language editions of Wiktionary can differ is enough to make it a PITA to handle everything. I'm not sure it's possible to create a completely universal, self-contained wikitextprocessor that works with any kind of wikitext from any kind of official Wikimedia project without going mad.
"Expansion" and "expanded" just means that templates have been processed: {{template|stuff}}
-> <template>stuff</template>
.
Agreed, I didn't think it made sense either, but wanted to confirm.
That's what I thought about expanded. I just wanted to confirm that you guys are seeing the val template fully expanded and that I was doing something wrong.
If you can post example working code I can run for the exported xml.bz2 file, that would be extremely helpful!
I'm using the OP posted code:
import os
os.chdir("/media/tb/linux/wikitextprocessor")
#!pip install wikitextprocessor
from functools import partial
from pathlib import Path
from typing import Any
from wikitextprocessor import Wtp
from wikitextprocessor.dumpparser import process_dump
def page_handler(page, wtp):
wtp.start_page(page.title)
node = wtp.parse(page.body, pre_expand=True)
value = wtp.node_to_wikitext(node)
print(value)
if __name__ == '__main__':
wtp = Wtp()
process_dump(
wtp,
"./wiki.xml.bz2",
{0, 10, 110, 828}, # namespace id
save_pages_path=Path('../debug')
)
print("Num pages:", wtp.saved_page_nums())
for _ in map(
partial(page_handler, wtp=wtp), wtp.get_all_pages([0])
):
pass
This has the diffs @xxyzz mentioned applied, plus the hack on module arguments:
if type(frame.args) == 'table' then https://github.com/tatuylonen/wikitextprocessor/commit/4bc34c4db234098c9542161d27ebb37afb5295da and https://github.com/tatuylonen/wikitextprocessor/commit/e3614fd13215ab9cebf8b2dda6f5c50419165ec5 are also needed.
Your script looks about right; with a similar script I get
When embedded in an [[atomic nucleus]], [[neutrons]] are (usually) stable particles. Outside the [[atomic nucleus|nucleus]], free [[neutron]]s are unstable and have a [[mean lifetime]] of <span class="nowrap"><span data-sort-value="7002879600000000000%E2%99%A0">879.6<span style="margin-left%3A0.3em%3Bmargin-right%3A0.15em%3B">±</span>0.8 [[second|s]]</span> (about <span class="nowrap"><span data-sort-value="7002840000000000000%E2%99%A0">14 min</span>, <span class="nowrap"><span data-sort-value="7001396000000000000%E2%99%A0">39.6 s</span>).<ref name="PDG-2020-n-life"> Therefore, the [[half-life]] for this process (which differs from the mean lifetime by a factor of <span class="texhtml+">[[Natural logarithm|ln]](2) ≈ 0.693</span>) is <span class="nowrap"><span data-sort-value="7002611000000000000%E2%99%A0">611<span style="margin-left%3A0.3em%3Bmargin-right%3A0.15em%3B">±</span>1 s</span> (about <span class="nowrap"><span data-sort-value="7002600000000000000%E2%99%A0">10 min</span>, <span class="nowrap"><span data-sort-value="7001110000000000000%E2%99%A0">11 s</span>).<ref name="Beringer-etal-2012-PDG-010001"><ref name="PDG-2007-baryons-LPL"> (An article<ref name="Gonzalez-2021"> published in October 2021, arrives at <span class="nowrap"><span data-sort-value="7002877750000000000%E2%99%A0">877.75<span style="margin-left%3A0.3em%3B"><span style="display%3Ainline-block%3Bmargin-bottom%3A-0.3em%3Bvertical-align%3A-0.4em%3Bline-height%3A1.2em%3Bfont-size%3A85%25%3Btext-align%3Aright%3B">+0.50<br>−0.44</span></span> s</span> for the mean lifetime).
where the {{val}}
s are expanded. This is the output of the print()
statement in page_handler()
! Not the file in debug/Words/Fr/...whateveritwas.txt
, that's actually a debug text file of the original wikitext source taken from the xml.bz2 dump file for convenience (for copying and modifying and processing modified pages).
EDIT: This is with the most current Wikitextprocessor commit.
I think you used the wrong method, please try this:
tree = wtp.parse("{{val|879.6|0.8|u=[[second|s]]}}", expand_all=True) # or wtp.parse(page.body, expand_all=True)
text = wtp.node_to_text(tree)
and save_pages_path
shouldn't be used.
xxyzz is correct, I missed the missing expand_all=True
.
All Lua errors in page "Free_neutron_decay" are fixed on the main branch now. But there would certainly be more errors on other pages.
fwiw, after looking into this more deeply, the only thing that seems to work with any reasonable level of accuracy and not require ongoing and continual massive engineering effort is just importing into mediawiki and rendering from there. There are pretty good scraping APIs that can be used to pull text from the page once rendered. Performance may be a bottleneck here, of course.
Are you using a complete Wikipedia dump as the source? I'm getting Lua errors, but then again I'm using the partial page dump file.
Num pages: 86
WIKIDATA QUERY succeded: item_id=None, result={'head': {'vars': ['itemLabel', 'itemDescription']}, 'results': {'bindings': [{'itemLabel': {'type': 'literal', 'value': 'None'}}]}}
Free neutron decay: ERROR: LUA error in #invoke('citation/CS1', 'citation\n', 'CitationClass=web\n') parent ('Template:Cite web', {'date': '2021-10-13', 'title': 'How Long Does a Neutron Live?', 'url': 'https://www.caltech.edu/about/news/how-long-does-a-neutron-live', 'access-date': '2021-10-14', 'website': 'California Institute of Technology', 'language': 'en'}) at ['Free neutron decay', 'Cite web', '#invoke']
[string "Module:Citation/CS1/Configuration"]:32: assign to undeclared variable 'uncategorized_namespaces_t'
etc.
I'm getting Lua errors
Please see my previous post of errors in "Module:Citation/CS1/Configuration": https://github.com/tatuylonen/wikitextprocessor/issues/90#issuecomment-1696730596
I fixed these errors by overwriting that page in sqlite db: add local
and replace undefined variables to nil
.
I guess Wikipedia doesn't have the assign to undeclared variable
Lua error is because mw.loadData()
doesn't run inside the same environment as the code calls it so the "strict" library doesn't affect "Module:Citation/CS1/Configuration". But I checked Scribunto code and it looks like the code only copies the environment variables but doesn't remove the metatable added by the "strict" library. I still don't know how Scribunto removes it, I must missed something.
Hi, thanks for the project!
I'm trying to extract text from wiki dumps. Page example: https://en.wikipedia.org/wiki/Free_neutron_decay I dowloaded the page and it's templates via https://en.wikipedia.org/wiki/Special:Export The page contains the following template -
{{val|879.6|0.8|u=[[second|s]]}}
which I'd like to be converted to879.6±0.8 s
text.Code (the latest in repo):
Output:
If I set
pre_expand
toFalse
, then:Probably it's smth simple, but I can't find a solution. Can you please help?
Github doesn't allowed to upload xml/bz2 files, so I uploaded my xml on Dropbox: link