tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
91 stars 23 forks source link

How to get text from from templates? #90

Closed rusg77 closed 6 months ago

rusg77 commented 1 year ago

Hi, thanks for the project!

I'm trying to extract text from wiki dumps. Page example: https://en.wikipedia.org/wiki/Free_neutron_decay I dowloaded the page and it's templates via https://en.wikipedia.org/wiki/Special:Export The page contains the following template - {{val|879.6|0.8|u=[[second|s]]}} which I'd like to be converted to 879.6±0.8 s text.

Code (the latest in repo):

    def test_simple_page(self):

        def page_handler(page: Page, wtp: Wtp | None = None) -> Any:
            wtp.start_page(page.title)
            node = wtp.parse(page.body, pre_expand=True)
            value = wtp.node_to_wikitext(node)
            print(value)

        wtp = Wtp(db_path=Path('../db/mydb'))
        process_dump(
            wtp,
            "../Wikipedia-20230825082710.xml.bz2",
            {0, 10, 110, 828},  # namespace id
            save_pages_path=Path('../pages')
        )

        print("Num pages:", wtp.saved_page_nums())

        for _ in map(
                partial(page_handler, wtp=wtp), wtp.get_all_pages([0])
        ):
            pass

Output:

Num pages: 86
.... <strong class="error">Template:val</strong>...

If I set pre_expand to False, then:

Num pages: 86
.... {{val|879.6|0.8|u=[[second|s]]}}...

Probably it's smth simple, but I can't find a solution. Can you please help?

Github doesn't allowed to upload xml/bz2 files, so I uploaded my xml on Dropbox: link

kristian-clausal commented 1 year ago

I'll try to take a look at this next week. Please also give more context around test_simplepage(self) (is this from a test file?), and whatever output you get. Do the .... mean you actually get output around the templates, for example, etc.

rusg77 commented 1 year ago

Thanks @kristian-clausal !

Please also give more context around test_simplepage(self) (is this from a test file?)

I just tried to run my code as test, but it doesn't matter. So the code actually:

from functools import partial
from pathlib import Path
from typing import Any

from wikitextprocessor import Wtp, Page
from wikitextprocessor.dumpparser import process_dump

def page_handler(page: Page, wtp: Wtp | None = None) -> Any:
    wtp.start_page(page.title)

    node = wtp.parse(page.body, pre_expand=True)
    value = wtp.node_to_wikitext(node)
    print(value)

if __name__ == '__main__':
    wtp = Wtp(db_path=Path('../db/mydb'))
    process_dump(
        wtp,
        "./Wikipedia-20230825082710.xml.bz2",
        {0, 10, 110, 828},  # namespace id
        save_pages_path=Path('../debug')
    )

    print("Num pages:", wtp.saved_page_nums())

    for _ in map(
            partial(page_handler, wtp=wtp), wtp.get_all_pages([0])
    ):
        pass

Do the .... mean you actually get output around the templates, for example, etc.

Yes, I got. The output is just a bit long, so I decided to simplify it. Here is the full output. Then main problem is that all the templates are <strong class="error">Template:val</strong>

Here the full output:

``` Num pages: 86 Free neutron decay: DEBUG: TABLE not properly closed at ['Free neutron decay'] parsing Decay process viewed from multiple levels??? started on line 91, detected on line 92 {{Short description|Decay of a neutron when outside a nucleus}} Template:use dmy dates Template:context [[Image:Beta-minus Decay.svg|thumb|300px| A [[schematic]] of the [[atomic nucleus|nucleus of an atom]] indicating {{SubatomicParticle|Beta-}} radiation, the emission of a fast electron from the nucleus (the accompanying antineutrino is omitted). In the Rutherford model for the nucleus, red spheres were protons with positive charge and blue spheres were protons [[Atomic number|tightly bound to an electron with no net charge]]. : The '''inset''' shows beta decay of a free neutron as it is understood today; an electron and antineutrino are created in this process.]] When embedded in an [[atomic nucleus]], [[neutrons]] are (usually) stable particles. Outside the [[atomic nucleus|nucleus]], free [[neutron]]s are unstable and have a [[mean lifetime]] of Template:val (about Template:val, Template:val). Therefore, the [[half-life]] for this process (which differs from the mean lifetime by a factor of Template:math) is Template:val (about Template:val, Template:val). (An article published in October 2021, arrives at Template:val for the mean lifetime). The [[beta decay]] of the neutron described in this article can be notated at four slightly different levels of detail, as shown in four layers of [[Feynman diagrams]] in a [[##multilayered Feynman diagrams anchor|section below]]. : Template:math The hard-to-observe Template:math quickly decays into an [[electron]] and its matching [[electron antineutrino|antineutrino]]. The subatomic reaction shown immediately above depicts the process as it was first understood, in the first half of the 20th century. The [[W and Z bosons|boson]] (Template:math) vanished so quickly that it was not detected until much later. Later, beta decay was understood to occur by the emission of a [[weak boson]] (Template:math), sometimes called a charged [[weak interaction|weak current]]. Beta decay specifically involves the emission of a Template:math boson from one of the [[down quark]]s hidden within the [[neutron]], thereby converting the down quark into an [[up quark]] and consequently the [[neutron]] into a [[proton]]. The following diagram gives a summary sketch of the beta decay process according to the present level of understanding. [[File:Beta Negative Decay.svg|thumb|[[Feynman diagram]] for beta decay of the neutron]] {| style="text-align%3Aleft%3Bvertical-align%3Abottom%3B" |- | style="text-align%3Acenter" | Template:small Template:small (Template:math) | | style="text-align%3Acenter" | Template:small Template:small (Template:math) | | | | |- | Template:big | | Template:big | | | | |- |    Template:math | Template:big   |   Template:math | +    |   Template:math | | | |- | | | | | Template:big | {{SubatomicParticle|Electron|link=yes}} | | +   Template:math |- | | | | | Template:big | | |- | | | | | colspan="3" |Template:small Template:math Template:small |} :Template:small For diagrams at several levels of detail, see [[##multilayered Feynman diagrams anchor|'''§ Decay process''']], below. : == Energy budget == For the free neutron, the [[decay energy]] for this process (based on the [[rest mass]]es of the neutron, proton and electron) is Template:val. That is the difference between the rest mass of the neutron and the sum of the rest masses of the products. That difference has to be carried away as [[kinetic energy]]. The maximal energy of the beta decay electron (in the process wherein the neutrino receives a vanishingly small amount of kinetic energy) has been measured at Template:val. The latter number is not well-enough measured to determine the comparatively tiny rest mass of the [[neutrino]] (which must in theory be subtracted from the maximal electron kinetic energy); furthermore, neutrino mass is constrained by many other methods. A small fraction (about 1 in 1,000) of free neutrons decay with the same products, but add an extra particle in the form of an emitted [[gamma ray]]: :Template:math This gamma ray may be thought of as a sort of "internal [[bremsstrahlung]]" that arises as the emitted beta particle (electron) interacts with the [[electric charge|charge]] of the proton in an electromagnetic way. In this process, some of the decay energy is carried away as [[photon energy]]. Gamma rays produced in this way are also a minor feature of beta decays of bound neutrons, that is, those within a nucleus. A very small minority of neutron decays (about four per million) are so-called "two-body (neutron) decays", in which a proton, electron and antineutrino are produced as usual, but the electron fails to gain the 13.6 eV necessary energy to escape the proton (the [[ionization energy]] of [[hydrogen]]), and therefore simply remains bound to it, as a neutral [[hydrogen atom]] (one of the "two bodies"). In this type of free neutron decay, in essence all of the neutron decay energy is carried off by the antineutrino (the other "body"). The transformation of a free proton to a neutron (plus a positron and a neutrino) is energetically impossible, since a free neutron has a greater mass than a free proton. However, see [[proton decay]]. == Decay process viewed from multiple levelsTemplate:anchor == Understanding of the beta decay process developed over several years, with the initial understanding of [[Enrico Fermi]] and colleagues starting at the "superficial" first level in the diagram below. Current understanding of weak processes rest at the fourth level, at the bottom of the chart, where the [[nucleons]] (the [[neutron]] and its successor [[proton]]) are largely ignored, and attention focuses only on the interaction between two quarks and a charged boson, with the decay of the boson almost treated as an afterthought. Because the charged [[weak boson]] (Template:math) vanishes so quickly, it was not actually observed during the first half of the 20th century, so the diagram at level 1 omits it; even at present it is for the most part inferred by its after-effects. Template:clear : {| style="text-align:left;vertical-align:bottom;" |} |- |colspan=9| Template:left |   |- | Template:math | Template:big | Template:math | + | | Template:math | | +   Template:math |   |   Template:small |- |colspan=9|
Template:left | |- | Template:math | Template:big | Template:math | | +   Template:math | | | | |   Template:small |- | | | | | Template:big | Template:math | | +   Template:math | |   Template:small |- |colspan=9|
Template:left    | |- | Template:math | Template:big | Template:math | + | Template:math | | | | |   Template:small |- | | | | | Template:big | {{SubatomicParticle|Electron|link=yes}} | | +   Template:math | |   Template:small |- |colspan=9|
Template:left | |- | Template:math | Template:big | Template:math | + | Template:math | | | | |   Template:small |- | | | | | Template:big | Template:math | | +   Template:math | |   Template:small |}
== Neutron lifetime puzzle == While the neutron lifetime has been studied for decades, there currently exists a lack of [[consilience]] on its exact value, due to different results from two experimental methods ("bottle" versus "beam"Template:efn). The "neutron lifetime anomaly" was discovered after the refinement of experiments with ultracold neutrons. While the [[error margin]] was once overlapping, increasing refinement in technique which should have resolved the issue has failed to demonstrate convergence to a single value. The difference in mean lifetime values obtained as of 2014 was approximately 9 seconds. Further, a prediction of the value based on [[quantum chromodynamics]] as of 2018 is still not sufficiently precise to support one over the other.Template:efn As explained by Wolchover (2018), the beam test would be incorrect if there is a decay mode that does not produce a proton. On 13 October 2021 the lifetime from the bottle method was updated to \tau_n=877.75 s{{Cite web|date=2021-10-13|title=How Long Does a Neutron Live?|url=https://www.caltech.edu/about/news/how-long-does-a-neutron-live|access-date=2021-10-14|website=California Institute of Technology|language=en}} increasing the difference to 10 seconds below the beam method value of \tau_n=887.7 s{{Cite journal|last1=Wilson|first1=Jack T.|last2=Lawrence|first2=David J.|last3=Peplowski|first3=Patrick N.|last4=Eke|first4=Vincent R.|last5=Kegerreis|first5=Jacob A.|date=2021-10-13|title=Measurement of the free neutron lifetime using the neutron spectrometer on NASA's Lunar Prospector mission|url=https://link.aps.org/doi/10.1103/PhysRevC.104.045501|journal=Physical Review C|volume=104|issue=4|pages=045501|doi=10.1103/PhysRevC.104.045501| arxiv=2011.07061|bibcode=2021PhRvC.104d5501W|s2cid=226955795}}{{Cite journal|last=Anonymous|date=2013-11-27|title=Discrepancy in Neutron Lifetime Still Unresolved|url=https://physics.aps.org/articles/v6/s150|journal=Physics|language=en|volume=6|doi=10.1103/Physics.6.s150|bibcode=2013PhyOJ...6S.150.}} and also on the same date a novel third method using data from the past NASA's [[Lunar Prospector|Lunar prospector]] mission reported a value of \tau_n=887 s{{Cite journal|last1=Wilson|first1=Jack T.|last2=Lawrence|first2=David J.|last3=Peplowski|first3=Patrick N.|last4=Eke|first4=Vincent R.|last5=Kegerreis|first5=Jacob A.|date=2021-10-13|title=Measurement of the free neutron lifetime using the neutron spectrometer on NASA's Lunar Prospector mission|url=https://link.aps.org/doi/10.1103/PhysRevC.104.045501|journal=Physical Review C|language=en|volume=104|issue=4|pages=045501|doi=10.1103/PhysRevC.104.045501| arxiv=2011.07061 |bibcode=2021PhRvC.104d5501W|s2cid=226955795|issn=2469-9985}}{{Cite journal|last1=Lawrence|first1=David J.|last2=Wilson|first2=Jack T.|last3=Peplowski|first3=Patrick N.|date=1 February 2021|title=Space-based measurements of neutron lifetime: Approaches to resolving the neutron lifetime anomaly|url=https://linkinghub.elsevier.com/retrieve/pii/S0168900220313164|journal=Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment|language=en|volume=988|pages=164919|doi=10.1016/j.nima.2020.164919|arxiv=2011.06095|bibcode=2021NIMPA.98864919L|s2cid=226307043}} but with great uncertainty. Yet another approach similar to the beam method has been explored with the [[J-PARC|Japan Proton Accelerator Research Complex]] (J-PARC) but it is too imprecise at the moment to be of significance on the analysis of the discrepancy.{{Cite journal|last1=Hirota|first1=K|last2=Ichikawa|first2=G|last3=Ieki|first3=S|last4=Ino|first4=T|last5=Iwashita|first5=Y|last6=Kitaguchi|first6=M|last7=Kitahara|first7=R|last8=Koga|first8=J|last9=Mishima|first9=K|last10=Mogi|first10=T|last11=Morikawa|first11=K|date=2020-12-15|title=Neutron lifetime measurement with pulsed cold neutrons|url=https://academic.oup.com/ptep/article/doi/10.1093/ptep/ptaa169/6020274|journal=Progress of Theoretical and Experimental Physics|language=en|volume=2020|issue=12|pages=123C02|doi=10.1093/ptep/ptaa169|arxiv=2007.11293|issn=2050-3911}}{{Cite web|date=2021-07-02|title=KEK tackles neutron-lifetime puzzle|url=https://cerncourier.com/a/kek-tackles-neutron-lifetime-puzzle/|access-date=2021-12-02|website=CERN Courier|language=en-GB}} == See also == * [[Halbach array]]-used in the "bottle" method == Footnotes == Template:notelist == References == Template:reflist == Bibliography == *Template:cite journal [[Category:Neutron]] [[Category:Radioactivity]] [[Category:Physical phenomena]] ```
kristian-clausal commented 1 year ago

Thanks, I was able to reproduce the errors, but as said I'll have to take a look at it first next week when I'm back to work.

A preliminary look is that... Ok, this is newer stuff, the process fails to load the modules because there have been recent additions to the function parameters concerning namespace ids, etc.

The parser needs to know the namespace ids, which are provided by a data/[lang_code]/namespace.json file.

You're parsing Wikipedia, which means the namespace ids are possibly different from the ones on en.wiktionary.org, although the Modules id was the same.

So at least there is a need for a namespace.json file in an appropriate data folder. However, those are hard-coded, so I don't think we have en.wikipedia.org specific data folders, because we've only been working with Wiktionary stuff, which is in hindsight an oversight.

@xxyzz is this about correct? Do you think using the Wiktionary namespaces.json is good enough for this?

And NOW, I'm really off to the weekend!

rusg77 commented 1 year ago

Thanks! Have a great weekend!

xxyzz commented 1 year ago

The template and module pages can't be found is because English Wikipedia also use upper case in the first letter in page title but use lower case in wikitext, this could be fixed similar to the Chinese Wiktionary.

Then I get [string "Module:val"]:775: bad argument #3 to 'format' (string expected, got userdata) error.

xxyzz commented 1 year ago

The Lua error I posted above is because some functions added at here https://github.com/tatuylonen/wikitextprocessor/blob/e5296c16f2d715e62121f23cb5057374da48cda3/wikitextprocessor/luaexec.py#L708

can't be passed to string.format(https://en.wikipedia.org/wiki/Module:Val#L-775)

jph00 commented 1 year ago

Thanks for agreeing to look into this @kristian-clausal -- FYI there will be a lot of people looking out for this because there's an ongoing Kaggle competition at the moment that needs clean wikipedia data!

xxyzz commented 1 year ago

"Module:Arguments" checks type(frame.getParent) is "function" at here: https://en.wikipedia.org/wiki/Module:Arguments#L-97 but because getParent is Python function, its type is userdata. An easy fix would be overwrite the "Module:Arguments" page in the sqlite db file and change the line 97 to:

if type(frame.args) == 'table' then

https://github.com/tatuylonen/wikitextprocessor/pull/89/commits/4bc34c4db234098c9542161d27ebb37afb5295da and https://github.com/tatuylonen/wikitextprocessor/pull/89/commits/e3614fd13215ab9cebf8b2dda6f5c50419165ec5 are also needed.

kristian-clausal commented 1 year ago

Thanks for agreeing to look into this @kristian-clausal -- FYI there will be a lot of people looking out for this because there's an ongoing Kaggle competition at the moment that needs clean wikipedia data!

Here's two pitfalls with wikitextprocessor you might want to keep in mind:

  1. The parser can create bad parse trees because the way wikitextprocessor parses things piecemeal. For wiktextract (for which wikitextprocessor really is created) it is very useful to 'know' what template is being extracted, so that it can use the template name and arguments as data. However, this means that in general, the parser is broken because it can't rely on seeing the whole text at once; a template can easily contain HTML tags, for example, that are part of an element that started in another template, and without expanding both of the templates the two tags get separated and the parsing breaks down. Technically, this could all be fixed with a ridiculously complex string + character metadata system where each character of a page would have associated metadata that tells how it came to be, but we have to use a ton of regexes and string manipulation to process those strings and coupling manipulated strings and their original metadata sounds like an insanely complex process using just standard Python tools. So yeah, the parser can't be 100% trusted.
  2. There is a really weird bug with the Lua engine that causes it to do really weird stuff with memory that I haven't been able to tackle. It is not reproducible and affects single multiprocessing processes (if it manifests at all) and basically scrambles around function references so that a function name might now refer to a completely different function... It might even be a Lupa bug, which is written in Cython. So basically, Lua parsing could break down at any time for unknown reasons. Possibly as a result of modules interacting with the global namespace in a 'stars align' order.

In conclusion, if you don't need to do the kind of stuff we need for wiktextract (wikitext -> parse tree with template nodes that can be processed piecemeal), you might just be better off by simply downloading a Wikipedia dump file with pre-expanded HTML pages. IIRC, there's even pages with metadata about the expansion of templates. If you're doing just expansions of templates (which would ultimately create a clean page without templates that the parser should be able to parse without problems as long as problem #2 doesn't affect it), then it's almost the same as just downloading the HTML dump, ultimately. Don't know if there's a dump with template-less wikitext source anywhere.

kristian-clausal commented 1 year ago

@xxyzz how is it going with the stuff in the PR? If things might conflict, I could assign this to you to poke at. You know more about the namespace id stuff, too.

xxyzz commented 1 year ago

It's not about the namespace id because Wiktionary and Wikipedia use the same namespace id for template and module, and if there are any difference, the namespace JSON file can be updated by running the get_namespaces.py file.

The two commits I made are for handling the case of page title, but they won't fix the Lua error I posted above: https://github.com/tatuylonen/wikitextprocessor/issues/90#issuecomment-1693151176 . To fix this error, I think there are two methods:

  1. remove the type checking code in Module:Arguments, I have tested the wikitext {{val|879.6|0.8|u=[[second|s]]}} can be expanded, and we could consider the issue resolved.
  2. wrap the Python function getParent inside a Lua function or create a Lua function that returns a Python object but I don't know how this could be implemented.
xxyzz commented 1 year ago

It's possible to fix the error with change to the make_frame() in file "lauexec.py" like this:

-        frame["getParent"] = debugGetParent
+        frame["parent_frame"] = pframe
+        frame["getParent"] = lua.eval(
+            "function(frame) return frame.parent_frame end"
+        )

but I don't know whether it'll affect other Lua code by adding a new key value pair to the frame table and it also removes the debug function you added before.

kristian-clausal commented 1 year ago

I was just looking at trying to wrap the 'debug' function inside a lua function. The "debug" names are just because I wanted to replace the lambdas that were previously there with named functions (so the only reason why they're named functions is for debugging). It might be possible to pass debugGetParents to a lua-function that itself returns a lua function that calls debugGetParents... 😮‍💨

xxyzz commented 1 year ago

~Those test functions can't be passed to Lua functions because the Lua function has to have the same parameters as the Python function.~(I forget *args parameter in Python code) And those test(debug) functions are just added some prints, that can be written in Lua. And do we still need those prints?

kristian-clausal commented 1 year ago

Yes, those prints are still needed because we haven't fixed the bug yet.

It might be possible to do something like:

        def debugGetParent(frame: "_LuaTable", *args) -> "_LuaTable":
            if args:
                print(
                    f"LAMBDA GETPARENT DEBUG (title: {title}): {repr(args)}"
                    f", process: {multiprocessing.current_process().name}"
                )
            if TYPE_CHECKING:
                assert isinstance(pframe, _LuaTable)
            return pframe

        lua_getp_generator = lua.eval("""
            function(py_func)
                wrapper_func = function(x)
                    return py_func(x)
                end
                return wrapper_func
            end
        """)

        wrappedDebugGetParent = lua_getp_generator(debugGetParent)
...
        frame["getParent"] = wrappedDebugGetParent
        frame["getTitle"] = debugGetTitle
   ...

Although I haven't yet figure out how to test this with the Wikipedia stuff. It passed the normal tests.

xxyzz commented 1 year ago

Why not just return the Python function py_func? That Lua code only accept a single parameter.

kristian-clausal commented 1 year ago

Because if we return the py_func, it will have the type userdata; that was the original problem, wasn't it? EDIT: To specify, lua's type() will unwrap the py_func and return userdata. If we wrap it in a lua function, it will return function... hopefully. I think so, based on testing it in interactive mode.

EDIT: I don't really know why these functions are python functions. I'll ask Tatu for the reason. It might just be simpler to turn them in lua functions.

EDIT2: Iirc, Lua functions just ignore if you feed them too many parameters. The *args is there so that we can check if the function is getting too many arguments, then print the debug line.

xxyzz commented 1 year ago

Return py_func directly doesn't work, type will still return userdata. And because Lua ignores extract parameter, the print in Python function become useless...

kristian-clausal commented 1 year ago

The lua wrapper function above + xxyzz/node seems to print the templates:

== Bibliography ==

*{{cite journal
 |author=Ерозолимский, Б.Г.
 |year=1975
 |title=Бета-распад нейтрона
 |trans-title=Neutron beta decay
 |journal=Успехи физических наук
 |volume=116  |issue=1  |pages=145–164
 |url=http://ufn.ru/ru/articles/1975/5/e/
}}

[[Category:Neutron]]
[[Category:Radioactivity]]
[[Category:Physical phenomena]]

But running Wtp.parse() with expand_all=True is still broken. EDIT: Actually, most of the simple templates seem to be expanded:

The hard-to-observe {{math| {{SubatomicParticle|W boson-}} }} quickly decays into an [[electron]] ->

The hard-to-observe <span class="texhtml+"> <span style="white-space%3Anowrap%3B"><span style="display%3Ainline-block%3Bfont-size%3A80%25%3Bline-height%3A1.0em%3Bmargin-bottom%3A-0.3em%3Btext-align%3Aright%3Bvertical-align%3A0.8em%3B"><sup style="font-size%3Ainherit%3Bline-height%3Ainherit%3Bvertical-align%3Abaseline%3B"><br><sub style="font-size%3Ainherit%3Bline-height%3Ainherit%3Bvertical-align%3Abaseline%3B"></sub></sup></span>W<span style="display%3Ainline-block%3Bfont-size%3A80%25%3Bline-height%3A1.0em%3Bmargin-bottom%3A-0.3em%3Btext-align%3Aleft%3Bvertical-align%3A0.8em%3B"><sup style="font-size%3Ainherit%3Bline-height%3Ainherit%3Bvertical-align%3Abaseline%3B">−</sup><br><sub style="font-size%3Ainherit%3Bline-height%3Ainherit%3Bvertical-align%3Abaseline%3B"></sub></span></span> </span> quickly decays into an [[electron]]
kristian-clausal commented 1 year ago

Ignore the above, it works because of xxyzz's branch. MessageBox is still broken, which was the problem.

xxyzz commented 1 year ago

Some Lua Modules are broken because the Scribunto code in the lua folder is too old and doesn't have the strict.lua file. Because it not a git submodule we don't know it's version(git commit). I add a Scribunto git submodule at the "strict" branch: https://github.com/xxyzz/wikitextprocessor/tree/strict

This still has Lua error but I don't have time to fix it today... and I haven't test if pip install works. The server install script also need to update if git submodule is used(just add 2 git clone options).

kristian-clausal commented 1 year ago

I went back to try the code I posted up again, and it does actually remove the errors with userdata leaking through. It was just that Message box failed because it couldn't be loaded, which I'm trying to figure out. Haven't tried the strict branch yet because the tests seem to be failing.

xxyzz commented 1 year ago

Yeah, I'm using your wrap Lua code to get around the Lua type check error. Updating Scribunto code is for fixing import errors in many Lua modules that import the strict module at the start and my branch currently doesn't work...

And I not sure if you notice that the Lua wrap function ignores extra parameters so the print lines in debug Python functions will never be used.

kristian-clausal commented 1 year ago

I keep on forgetting to make it whatever Lua's version of *args is. ... and args = {...} etc. except can't remember which version Lua 5.1 supports and then I sit down and forget again. In this spirit, I will leave it for tomorrow!

xxyzz commented 1 year ago

I think I have fixed the Lua error caused by "strict.lua" file but there are many errors in "Module:Citation/CS1/Configuration", for example:

And after fixing errors in Module:Citation/CS1/Configuration by overwrite this page, I get two new errors:

I keep on forgetting to make it whatever Lua's version of *args is. ... and args = {...} etc.

I found this https://www.lua.org/pil/5.2.html document yesterday, maybe it'll be helpful.

qrdlgit commented 1 year ago

@xxyzz @kristian-clausal

Thanks folks, this is really awesome.

I looked for template expanded dumps that @kristian-clausal mentioned , but I couldn't find any. Do you have any links?

If they don't exist and aren't likely to exist, I'm thinking of writing some lua only code to do expansion of the pages-articles dumps.

My reasoning for lua only is my worry about lupa given some of the comments above by @kristian-clausal about potential memory issues. The integration bug also that @xxyzz observed seems like something that could pop up in other parts when using lupa.

I doubt the memory issues discussed above are in luajit as it's fairly simple and very broadly used. Lua is actually a nice and tidy framework, and can be quite performant. Its lack of 3rd party support is actually a feature in some ways rather than a bug as security is much easier to validate, which is probably why wikipedia chose lua in the first place.

Thoughts? Any tips / suggestions about gotchas I should consider before starting this would help. In particular, I don't want to reinvent something someone else has done or is working on.

kristian-clausal commented 1 year ago

It seems that the HTML dumps have been discontinued... For over a decade. They still have a silly big link on the dumps page to a directory with stuff from 2008. In that case, yeah, doing the expanding is necessary.

qrdlgit commented 1 year ago

I found this https://www.lua.org/pil/5.2.html document yesterday, maybe it'll be helpful.

I noticed you folks are using lua51 in the code. I changed it as I couldn't find lupa.lua51 easily.

edit: ok, lulz, that's not lua 52 the link is referring to, however the comment is still relevant!

qrdlgit commented 1 year ago

OK nvm, I see that wikipedia is using 51.

xxyzz commented 1 year ago

It seems that the HTML dumps have been discontinued... For over a decade. They still have a silly big link on the dumps page to a directory with stuff from 2008. In that case, yeah, doing the expanding is necessary.

New HTML dumps are here: https://dumps.wikimedia.org/other/enterprise_html/runs/ but currently missing many pages.

The integration bug also

The Lua errors I posted before are not related to Lupa or Lua. Just our code doesn't implement some features in MediaWiki.

qrdlgit commented 1 year ago

@xxyzz

From above, this looks like a lupa integration issue:

"Module:Arguments" checks type(frame.getParent) is "function" at here: https://en.wikipedia.org/wiki/Module:Arguments#L-97 but because getParent is Python function, its type is userdata. An easy fix would be overwrite the "Module:Arguments" page in the sqlite db file and change the line 97 to:

if type(frame.args) == 'table' then

qrdlgit commented 1 year ago

I looked at the enterprise stuff only briefly, not very open source. :p Maybe I'm stepping on a profit center for wikipedia, but given this is all user sourced content, I think that's fair.

qrdlgit commented 1 year ago

If it would be easier to figure out how to get wikitextprocessor to first class support pages-articles I am happy to help with that. But you folks would have to agree that it would be willing to do first class support. In particular, performance issues would have to be addressed as it's been a huge stumbling block whenever we do anything with potentially TB sized files (usually just pages-articles, which is 60GB though)

Mutiprocessor support would be an asbolute requirement.

I'd much prefer to leverage existing efforts, but the problem I see right now is nobody is taking the problem seriously. I suspect the lua is the primary stumbling block

qrdlgit commented 1 year ago

Another effort is here -- https://github.com/spencermountain/wtf_wikipedia/blob/643add955f6cf5ed278ef123e7fd31b951842ce6/src/template/custom/text-only/functions.js#L386

Spencer has (sorta) rewritten the lua modules in javascript. :p

qrdlgit commented 1 year ago

Also, one final question, have you folks been able to solve the original problem for this thread?

There is a dropbox file with the xml.bz2 export:

https://www.dropbox.com/scl/fi/ftbygha0eovyz5xl3rc6y/Wikipedia-20230825082710.xml.bz2?dl=0&rlkey=kd43093wgdd2sq08rvras32rf

I tried the suggestions @xxyzz made above, but it didn't seem to work. If it works for you, can you paste the the output we should expect to see for the first few paraphs with the val template being used?

xxyzz commented 1 year ago

Some "Citation/CS1" modules are still broken(see my post: https://github.com/tatuylonen/wikitextprocessor/issues/90#issuecomment-1696730596) because the code doesn't implement mw.message and mw.language:formatDate. Other templates in the page "Free neutron decay" should be expended without error.

kristian-clausal commented 1 year ago

I don't even know what "first class support" means. In any case, wiktextract (which is what wikitextprocessor was created for) runs using multiprocessing, separating out pages into chunks and processing them separately; data from other pages is accessed through the new SQLite database (thanks to xxyzz).

qrdlgit commented 1 year ago

Hmm, not sure what you mean by expanded.

Here's what I see: When embedded in an [[atomic nucleus]], [[neutrons]] are (usually) stable particles. Outside the [[atomic nucleus|nucleus]], free [[neutron]]s are unstable and have a [[mean lifetime]] of {{val|879.6|0.8|u=[[second|s]]}} (about {{val|14|u=minutes}}, {{val|39.6|u=seconds}}).<ref name="PDG-2020-n-life" /> Therefore, the [[half-life]] for this process (which differs from the mean lifetime by a factor of {{math|[[Natural logarithm|ln]](2) ≈ 0.693}}) is {{val|611|1|u=s}} (about {{val|10|u=minutes}}, {{val|11|u=seconds}}).<ref name="Beringer-etal-2012-PDG-010001" /><ref name="PDG-2007-baryons-LPL" /> (An article<ref name="Gonzalez-2021" /> published in October 2021, arrives at {{val|877.75|0.50|0.44|u=s}} for the mean lifetime).

qrdlgit commented 1 year ago

I don't even know what "first class support" means. In any case, wiktextract (which is what wikitextprocessor was created for) runs using multiprocessing, separating out pages into chunks and processing them separately; data from other pages is accessed through the new SQLite database (thanks to xxyzz).

Sounds great. I just wanted to make the offer as you folks have done a lot of really great work here. I just think a solution will have to make it a primary priority to do a good job for wikipedia.

kristian-clausal commented 1 year ago

We are currently working mainly (only) with Wiktionary stuff because that's where the project started; ideally, wikitextprocessor should of course work with Wikipedia, too, but as we have seen the problem is that even though Scribunto etc. can be the same underlying engine in different Wikimedia projects, the way different language editions of Wiktionary can differ is enough to make it a PITA to handle everything. I'm not sure it's possible to create a completely universal, self-contained wikitextprocessor that works with any kind of wikitext from any kind of official Wikimedia project without going mad.

"Expansion" and "expanded" just means that templates have been processed: {{template|stuff}} -> <template>stuff</template>.

qrdlgit commented 1 year ago

Agreed, I didn't think it made sense either, but wanted to confirm.

That's what I thought about expanded. I just wanted to confirm that you guys are seeing the val template fully expanded and that I was doing something wrong.

If you can post example working code I can run for the exported xml.bz2 file, that would be extremely helpful!

qrdlgit commented 1 year ago

I'm using the OP posted code:

import os
os.chdir("/media/tb/linux/wikitextprocessor")
#!pip install wikitextprocessor
from functools import partial
from pathlib import Path
from typing import Any

from wikitextprocessor import Wtp
from wikitextprocessor.dumpparser import process_dump

def page_handler(page, wtp):
    wtp.start_page(page.title)

    node = wtp.parse(page.body, pre_expand=True)
    value = wtp.node_to_wikitext(node)
    print(value)

if __name__ == '__main__':
    wtp = Wtp()
    process_dump(
        wtp,
        "./wiki.xml.bz2",
        {0, 10, 110, 828},  # namespace id
        save_pages_path=Path('../debug')
    )

    print("Num pages:", wtp.saved_page_nums())

    for _ in map(
            partial(page_handler, wtp=wtp), wtp.get_all_pages([0])
    ):
        pass

This has the diffs @xxyzz mentioned applied, plus the hack on module arguments:

if type(frame.args) == 'table' then https://github.com/tatuylonen/wikitextprocessor/commit/4bc34c4db234098c9542161d27ebb37afb5295da and https://github.com/tatuylonen/wikitextprocessor/commit/e3614fd13215ab9cebf8b2dda6f5c50419165ec5 are also needed.

kristian-clausal commented 1 year ago

Your script looks about right; with a similar script I get

When embedded in an [[atomic nucleus]], [[neutrons]] are (usually) stable particles. Outside the [[atomic nucleus|nucleus]], free [[neutron]]s are unstable and have a [[mean lifetime]] of <span class="nowrap"><span data-sort-value="7002879600000000000%E2%99%A0">879.6<span style="margin-left%3A0.3em%3Bmargin-right%3A0.15em%3B">±</span>0.8&nbsp;[[second|s]]</span> (about <span class="nowrap"><span data-sort-value="7002840000000000000%E2%99%A0">14&nbsp;min</span>, <span class="nowrap"><span data-sort-value="7001396000000000000%E2%99%A0">39.6&nbsp;s</span>).<ref name="PDG-2020-n-life"> Therefore, the [[half-life]] for this process (which differs from the mean lifetime by a factor of <span class="texhtml+">[[Natural logarithm|ln]](2) ≈ 0.693</span>) is <span class="nowrap"><span data-sort-value="7002611000000000000%E2%99%A0">611<span style="margin-left%3A0.3em%3Bmargin-right%3A0.15em%3B">±</span>1&nbsp;s</span> (about <span class="nowrap"><span data-sort-value="7002600000000000000%E2%99%A0">10&nbsp;min</span>, <span class="nowrap"><span data-sort-value="7001110000000000000%E2%99%A0">11&nbsp;s</span>).<ref name="Beringer-etal-2012-PDG-010001"><ref name="PDG-2007-baryons-LPL"> (An article<ref name="Gonzalez-2021"> published in October 2021, arrives at <span class="nowrap"><span data-sort-value="7002877750000000000%E2%99%A0">877.75<span style="margin-left%3A0.3em%3B"><span style="display%3Ainline-block%3Bmargin-bottom%3A-0.3em%3Bvertical-align%3A-0.4em%3Bline-height%3A1.2em%3Bfont-size%3A85%25%3Btext-align%3Aright%3B">+0.50<br>−0.44</span></span>&nbsp;s</span> for the mean lifetime).

where the {{val}}s are expanded. This is the output of the print() statement in page_handler()! Not the file in debug/Words/Fr/...whateveritwas.txt, that's actually a debug text file of the original wikitext source taken from the xml.bz2 dump file for convenience (for copying and modifying and processing modified pages).

EDIT: This is with the most current Wikitextprocessor commit.

xxyzz commented 1 year ago

I think you used the wrong method, please try this:

tree = wtp.parse("{{val|879.6|0.8|u=[[second|s]]}}", expand_all=True)  # or wtp.parse(page.body, expand_all=True)
text = wtp.node_to_text(tree)

and save_pages_path shouldn't be used.

kristian-clausal commented 1 year ago

xxyzz is correct, I missed the missing expand_all=True.

xxyzz commented 1 year ago

All Lua errors in page "Free_neutron_decay" are fixed on the main branch now. But there would certainly be more errors on other pages.

qrdlgit commented 1 year ago

fwiw, after looking into this more deeply, the only thing that seems to work with any reasonable level of accuracy and not require ongoing and continual massive engineering effort is just importing into mediawiki and rendering from there. There are pretty good scraping APIs that can be used to pull text from the page once rendered. Performance may be a bottleneck here, of course.

kristian-clausal commented 1 year ago

Are you using a complete Wikipedia dump as the source? I'm getting Lua errors, but then again I'm using the partial page dump file.

Num pages: 86
WIKIDATA QUERY succeded: item_id=None, result={'head': {'vars': ['itemLabel', 'itemDescription']}, 'results': {'bindings': [{'itemLabel': {'type': 'literal', 'value': 'None'}}]}}
Free neutron decay: ERROR: LUA error in #invoke('citation/CS1', 'citation\n', 'CitationClass=web\n') parent ('Template:Cite web', {'date': '2021-10-13', 'title': 'How Long Does a Neutron Live?', 'url': 'https://www.caltech.edu/about/news/how-long-does-a-neutron-live', 'access-date': '2021-10-14', 'website': 'California Institute of Technology', 'language': 'en'}) at ['Free neutron decay', 'Cite web', '#invoke']
[string "Module:Citation/CS1/Configuration"]:32: assign to undeclared variable 'uncategorized_namespaces_t'

etc.

xxyzz commented 1 year ago

I'm getting Lua errors

Please see my previous post of errors in "Module:Citation/CS1/Configuration": https://github.com/tatuylonen/wikitextprocessor/issues/90#issuecomment-1696730596

I fixed these errors by overwriting that page in sqlite db: add local and replace undefined variables to nil.

xxyzz commented 1 year ago

I guess Wikipedia doesn't have the assign to undeclared variable Lua error is because mw.loadData() doesn't run inside the same environment as the code calls it so the "strict" library doesn't affect "Module:Citation/CS1/Configuration". But I checked Scribunto code and it looks like the code only copies the environment variables but doesn't remove the metatable added by the "strict" library. I still don't know how Scribunto removes it, I must missed something.