tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
94 stars 23 forks source link

Newlines inside `includeonly` are expanded on our side, but not in wikitext. #314

Closed kristian-clausal closed 1 month ago

kristian-clausal commented 1 month ago

For example: Template:ti verb

(''[[transitive|<span style="color:green">transitive</span>]] & [[intransitive|<span style="color:green">intransitive</span>]]'')<includeonly>[[Category:Transitive verbs]]
[[Category:Intransitive verbs]]</includeonly><noinclude>
{{documentation}}
<!-- Categories, interwikis and TemplateData goes to the /doc subpage -->
</noinclude>

is used on one page excrete. The includeonly element has a newline, which is rendered on our side as 'glosses': ['(transitive & intransitive)\n Foooo.']. The Template:biology on the same page doesn't get the newline, because there is no newline inside the includeonly.

I mixed up onlyinclude and includeonly for a while and it took me a while to understand which is which... onlyinclude is text that is the only thing you want to output when the template is expanded (anything outside of it is discarded). includeonly, which is what is at issue here, is a piece of text like a Category link that you don't want to appear on the template's own display page. That is, you don't want Template:ti verb to be appear in the Transitive verbs category, so includeonly will only let the category link be rendered when the template is being expanded on some other page.

However, I don't understand why the newline disappears in this case. Either this is so common in wikitext that they just went ahead and removed all newlines in onlyinclude, or something else weird.

Has a newline after "transitive)" on Wiktionary

(''[[transitive|<span style="color:green">transitive</span>]] & [[intransitive|<span style="color:green">intransitive</span>]]'')<includeonly>

[[Category:Transitive verbs]]

[[Category:Intransitive verbs]]
</includeonly><noinclude>
{{documentation}}
<!-- Categories, interwikis and TemplateData goes to the /doc subpage -->
</noinclude>

Does not have a newline after "transitive)"Wiktionary

(''[[transitive|<span style="color:green">transitive</span>]] & [[intransitive|<span style="color:green">intransitive</span>]]'')<includeonly>

[[Category:Transitive verbs]]

[[Category:Intransitive verbs]]</includeonly><noinclude>
{{documentation}}
<!-- Categories, interwikis and TemplateData goes to the /doc subpage -->
</noinclude>
kristian-clausal commented 1 month ago
<includeonly>

test

test

[[Category:Transitive verbs]]

[[Category:Intransitive verbs]]</includeonly>

results in "\n\n\ntest\n\ntest", but

<includeonly>test

[[Category:Transitive verbs]]

[[Category:Intransitive verbs]]</includeonly>

results in "test", with no trailing newlines, or rather, no trailing whitespace.

Here are the rules, I think:

  1. If there's a newline just before </includeonly>, render it.
  2. If there's non-whitespace text inside the element that isn't a category link (which wouldn't render), then include all the whitespace that appears from the start to the element until the end of the text.

I'm not sure this is documented anywhere.

kristian-clausal commented 1 month ago

A wrinkle: we remove includeonly tags when adding the page to the database. Just rip them out.

xxyzz commented 1 month ago

I think MediaWiki doesn't change new lines in <includeonly>. I have tested your examples in sandbox page and the Special:ExpandTemplates page, new lines in <includeonly> are not removed. We should get the same expanded wikitext in the "Result" section in "Special:ExpandTemplates" page.

--<includeonly>

test

test

[[Category:Transitive verbs]]

[[Category:Intransitive verbs]]</includeonly>--

expands to:

--

test

test

[[Category:Transitive verbs]]

[[Category:Intransitive verbs]]--

and

--<includeonly>test

[[Category:Transitive verbs]]

[[Category:Intransitive verbs]]</includeonly>--

expands to:

--test

[[Category:Transitive verbs]]

[[Category:Intransitive verbs]]--

I guess you only see the preview HTML? IMO that's not what we should imitate, at least not before and during expanding, it should happen at when we convert wikitext to plain text. This also happens when <includeonly> is not used.

For simple edition extractor code, maybe use str.replace("\n", "") at the moment? We're supposed to process the tag template separately from gloss text anyway.

kristian-clausal commented 1 month ago

Currently, some templates generate newlines inside things like glosses. This is not acceptable:

Simple English Wiktionary:

# {{ti verb}} Foo

-> (transitive)\n Foo

This should be (transitive) Foo, like on the webpage.

I will ignore what the Expand templates page says, because there is something fucky going on. At some point, the contents of the includeonly gets rstrip()ped or whatever the PHP equivalent is, probably after the category links have been expanded. But currently we're not handling includeonly at all, just removing the tags!

If you can't come up with a better solution, I will merge this. The results is what matters, because our implementation does definitely not follow the wikitext implementation; at best we're approximating it. This IS just a hack, but it's better than nothing.

xxyzz commented 1 month ago

I'm not sure if you notice the new lines also removed when <includeonly> is not used, I think this conversion happens at the process when MediaWiki converts wikitext to HTML, this is not related to how <includeonly> is handled.

Same for our code, I think this is same as how we remove category link from expanded wikitext, I think MediaWiki at this step also removes new lines around these links.

kristian-clausal commented 1 month ago

You are correct! Damn it. I thought it was the includeonly.

Simple English Wiktionary -> Template:ti verb -> edit and use "Preview page with this template" with "excrete"

(''[[transitive|<span style="color:green">transitive</span>]] & [[intransitive|<span style="color:green">intransitive</span>]]'')

[[Category:Transitive verbs]]

[[Category:Intransitive verbs]]t

-> 1. ([biology](https://simple.wiktionary.org/wiki/biology)) ([transitive](https://simple.wiktionary.org/wiki/transitive) & [intransitive](https://simple.wiktionary.org/wiki/intransitive))t If your body excretes waste material,

but

(''[[transitive|<span style="color:green">transitive</span>]] & [[intransitive|<span style="color:green">intransitive</span>]]'')

[[Category:Transitive verbs]]

[[Category:Intransitive verbs]]
t

-> 1. ([biology](https://simple.wiktionary.org/wiki/biology)) ([transitive](https://simple.wiktionary.org/wiki/transitive) & [intransitive](https://simple.wiktionary.org/wiki/intransitive)) t If your body excretes waste material,

The newlines and white space is removed before? the Category links, doesn't seem to have to do with trimming the end of the expanded value.

xxyzz commented 1 month ago

Maybe we should ask a MediaWiki developer what's the rules of convert newlines around category links to HTML if this is not documented...

In the meantime, I think we could temporary get around this problem if we could extract the link nodes in expanded tag template and save its category links by calling clean_node().

kristian-clausal commented 1 month ago

Instead of fixing this here, PR 843 for wiktextract solves the problem on wiktextract's side, in clean_value.

It's a bit weird that there's no really convenient place to fix this on wikitextprocessor's side (we do have all the to_X functions in Wtp), and it might be better to have clean_value's and clean_node's functionality on the Wikitextprocessor side. It's a bit weird now.