Output templates as AST directly

sweble / sweble-wikitext

The Sweble Wikitext Components module provides a parser for MediaWiki's wikitext and an engine trying to emulate the behavior of a MediaWiki.

http://sweble.org/sites/swc-devel/develop-latest/tooling/sweble/sweble-wikitext

70 stars 27 forks source link

Output templates as AST directly #88

Closed xtexChooser closed 2 years ago

xtexChooser commented 2 years ago

Hello,

I want to parse some Wikitexts recently and I found this library.

I want to extract some data from InfoBox(es) but I found that this library seems to only be able to preprocess templates but not parsing templates into AST directly.

May there be a solution?

Thanks.

wetneb commented 2 years ago

I am pretty sure it is able to parse templates, there are classes in the AST for templates (WtTemplate), template arguments (WtTemplateArguments), argument names (WtName) and argument values (WtValue). Positional arguments are also supported. What else do you need?

xtexChooser commented 2 years ago

Oh, thanks, I will have a look about them!

Thanks!

xtexChooser commented 2 years ago

Hey, @wetneb

I tryed to parse a wt doc but it seems that templates are not parsed correctly.

    println(
        WtAstPrinter.print(
            WikitextParser(WikitextParserConfig)
                .parseArticle(
                    """
        {{About|123456}}
        {{exclusive|java}}
        {{History|infdev}}
        {{History|alpha}}
        {{History|java}}
        {{reflist}}
    """.trimIndent(), "test"
                )
        )
    )

WtParsedWikitextPage(
    {P} entityMap = -
    {P} warnings = C[]
    [0] = WtImStartTag(
        {P} name = "@p"
        xmlAttributes = WtXmlAttributes[]
    ),
    [1] = "{{About|123456}}",
    [2] = WtNewline("\n"),
    [3] = "{{exclusive|java}}",
    [4] = WtNewline("\n"),
    [5] = "{{History|infdev}}",
    [6] = WtNewline("\n"),
    [7] = "{{History|alpha}}",
    [8] = WtNewline("\n"),
    [9] = "{{History|java}}",
    [10] = WtNewline("\n"),
    [11] = "{{reflist}}",
    [12] = WtImEndTag(
        {P} name = "@p"
    )
)

Templates has been parsed as texts but not WtTemplate

xtexChooser commented 2 years ago

ops, it can be parsed with WtPreprocessor, what's the difference between WikitextParser and WikitextPreprocessor?

wetneb commented 2 years ago

No idea! The pipeline I use in OpenRefine is as follows:

            // Encoding validation

            WikitextEncodingValidator v = new WikitextEncodingValidator();

            String wikitext = CharStreams.toString(reader);
            String title = "Page title";
            ValidatedWikitext validated = v.validate(parserConfig, wikitext, title);

            // Pre-processing
            WikitextPreprocessor prep = new WikitextPreprocessor(parserConfig);

            WtPreproWikitextPage prepArticle = (WtPreproWikitextPage) prep.parseArticle(validated, title, false);

            // Parsing
            PreprocessedWikitext ppw = PreprocessorToParserTransformer
                    .transform(prepArticle);

            WikitextParser parser = new WikitextParser(parserConfig);

            WtParsedWikitextPage parsedArticle;
            parsedArticle = (WtParsedWikitextPage) parser.parseArticle(ppw, title);

All I can say is that this gives you parsed templates.

xtexChooser commented 2 years ago

Thanks!

hannesd commented 2 years ago

Wikitext parsing is designed as a two-staged process (at least it was at the time this library was written). First pre-processing identifies and evaluates templates. This results in an altered Wikitext that is then fed into the actual Wikitext parser.