tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
89 stars 23 forks source link

Presence of spurious text. #243

Open LeMoussel opened 4 months ago

LeMoussel commented 4 months ago

In some analyzed texts we have the presence of characters such as {{!-}}, }}|, {{{blbla bla|}}} etc.

For example in the following articles:

xxyzz commented 4 months ago

Probably same as https://github.com/tatuylonen/wiktextract/issues/533, we will check this bug next week.

xxyzz commented 4 months ago

250 should fix this bug if it's the same bug as the wiktextract 533 issue.

LeMoussel commented 4 months ago

Alpes-de-Haute-Provence The output is OK. No errors, warnings and debugs messages.

Akhenaton The output is KO. In the output presence of

|
|align="center" valign="middle"|
|}
| style="text-align:center;padding:2px;" |
| style="text-align:left;padding:2px;" |
|}
|
|align="center" valign="middle"|
|}
| style="text-align:center;padding:2px;" |
| style="text-align:left;padding:2px;" |
|}

debugs message:

Akhenaton: DEBUG: HTML tag not properly closed at ['Akhenaton'] parsing Akhenaton/Règne/Révolution religieuse/Période noire ? started on line 136, detected on line 470 Akhenaton: DEBUG: HTML tag not properly closed at ['Akhenaton'] parsing Akhenaton/Règne/Révolution religieuse started on line 102, detected on line 470 Akhenaton: DEBUG: HTML tag not properly closed at ['Akhenaton'] parsing Akhenaton/Règne/Révolution religieuse started on line 100, detected on line 470

Anubis The output is OK. No errors, warnings and debugs messages.

Algèbre de Boole (logique) The output is OK. No errors, warnings and debugs messages.

Almost good :relaxed:, but for "Akhenaton" maybe it's another anomaly?

kristian-clausal commented 4 months ago

Currently downloading the French Wikipedia dump to make a new .db file with the updated data structure, so I'll be taking a look a look at this maybe by tomorrow.

kristian-clausal commented 4 months ago

I'm getting mostly correct output for Akhenaton. I've found issues, but not the ones that you have here.

Some of the image links, like in the homonym template at the start of the page, don't have alt texts at all; they take the last argument, class=noviewer.

I also found a few broken table ends, |}, pairs after another, which is probably what is left of the broken tables in your post. I'll take a deeper look tomorrow.

kristian-clausal commented 3 months ago

There's a PR for wiktextract that should take care of the last of the fixes I've attempted here.

clean_value() is supposed to remove wikitext tables {| ... |}, which we will continue to do. However, HTML tag tables <table>...</table> will be left, and will just render its contents on cell after another linearly. There hasn't been any need for wiktextract to handle this any better, and we'll keep it this way (at least for a while); if you do not want to see HTML tables or wikitext tables in the output, or do want to see both, they need to be handled with a node handling function. This might change in the future, if we change how clean_value is implemented, or create a separate thing (a library or a wikitextprocessor-specific cleaning function).

Other issues addressed: image links are rare in a Wiktionary context, so they weren't handled well, but now several issues should be fixed. There were other things but I can't actually remember what they were, so they couldn't have been that important!

Please pull the newest commits of wikitextprocessor and wiktextract sometime tomorrow and check if everything looks better now. The PR in wiktextract needs to be looked at by someone other than me because it involves negative lookahead in regex and I am already starting to see into the sixth dimension.

LeMoussel commented 3 months ago

[Alt: icône décorative] Portail des Alpes

- [ ] [Akhenaton](https://fr.wikipedia.org/wiki/Akhenaton) KO
Presence of `[Alt: Page d’aide sur l’homonymie]` and this (many times)

| |align="center" valign="middle"| |} | style="text-align:center;padding:2px;" | | style="text-align:left;padding:2px;" | |}

- [x] [Anubis](https://fr.wikipedia.org/wiki/Anubis) OK
- [ ] [Algèbre de Boole (logique)](https://fr.wikipedia.org/wiki/Alg%C3%A8bre_de_Boole_(logique)) KO
Presence of `[Alt: Page d’aide sur l’homonymie]`

[wikitext_parse-Akhenaton.txt](https://github.com/tatuylonen/wikitextprocessor/files/14588399/wikitext_parse-Akhenaton.txt)
[wikitext_parse-Algèbre de Boole (logique).txt](https://github.com/tatuylonen/wikitextprocessor/files/14588400/wikitext_parse-Algebre.de.Boole.logique.txt)
[wikitext_parse-Alpes-de-Haute-Provence.txt](https://github.com/tatuylonen/wikitextprocessor/files/14588401/wikitext_parse-Alpes-de-Haute-Provence.txt)

**Python Test code:**
```python
import re
import requests
from typing import Optional

# https://github.com/tatuylonen/wikitextprocessor/
from wikitextprocessor import (
    Wtp,
    NodeKind,
    WikiNode,
)
# https://github.com/tatuylonen/wiktextract
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig
from wiktextract.page import clean_node

def clean_node_handler(node)-> Optional[str]:
    """Process nodes when encountering them.
    For example by filtering them or changing them if needed."""
    assert isinstance(node, WikiNode)

    if node.kind == NodeKind.TEMPLATE:
        if node.largs[0][0] in ['Semi-protection', 'Semi-protection longue', 'Confusion', 'coord']:
            return ""
        if re.match('Infobox', node.largs[0][0], re.I):
            return ""
        if re.match('Article', node.largs[0][0], re.I):
            return ""
        if re.match('Référence', node.largs[0][0], re.I):
            return ""

    if node.kind == NodeKind.LEVEL2:
        if node.largs[0][0] in  ['Annexes', 'Notes et références', 'Voir aussi']:
            return ""

    if node.kind == NodeKind.LINK:
        if re.match('Fichier:', node.largs[0][0], re.I):
            return ""

    #if node.kind == NodeKind.HTML:
        #print(node.sarg)

    #if hasattr(node, 'largs') and len(node.largs) > 0:
    #    if node.largs[0][0] in  ['=== Langues ===']:

    return None

def template_handler(name, args_ht):
    if len(args_ht) == 0:
        return ""
    return None

if __name__ == '__main__':
    extension_tags = {
        "maplink": {"parents": ["phrasing"], "content": ["phrasing"]},
        "poem": {"parents": ["phrasing"], "content": ["phrasing"]},
        "gallery": {"parents": ["phrasing"], "content": ["phrasing"]},
        "graph": {"parents": ["phrasing"], "content": ["phrasing"]},
        "mapframe": {"parents": ["phrasing"], "content": ["phrasing"]},
        "timeline": {"parents": ["phrasing"], "content": ["phrasing"]},
    }
    wxr = WiktextractContext(
        wtp = Wtp(
            db_path="fr-wiki-latest.db",
            lang_code="fr",
            project="wikipedia",
            extension_tags=extension_tags,
        ),
        config=WiktionaryConfig(),
    )

    wiki_page_title = 'Alpes-de-Haute-Provence'

    wiki_page = wxr.wtp.get_page(wiki_page_title)

    wxr.wtp.start_page(wiki_page.title)
    wxr.wtp.invoke_aliases = wxr.wtp.invoke_aliases | {"#invoque"}

    info_log = f"Analyse: '{wiki_page_title}'\n"

    wiki_nodes = wxr.wtp.parse(text=wiki_page.body)
    text = clean_node(
            wxr=wxr,
            sense_data={},
            wikinode=wiki_nodes,
            collect_links=False,
            node_handler_fn=clean_node_handler,
            template_fn=template_handler,
        )

    if len(wxr.wtp.errors) > 0:
        info_log += f"# Erreurs: {len(wxr.wtp.errors)}\n"
    if len(wxr.wtp.warnings) > 0:
            info_log += f"# Warnings: {len(wxr.wtp.warnings)}"

    print(info_log)

    with open(f'wikitext-{wiki_page_title}.txt', 'w', encoding='utf-8') as f:
        f.write(wiki_page.body)
    with open(f'wikitext_parse-{wiki_page_title}.txt', 'w', encoding='utf-8') as f:
        f.write(text)
xxyzz commented 3 months ago

[Alt: something] is the "alt" text of an image, this was added from a recent pr: https://github.com/tatuylonen/wiktextract/pull/539

I guess you only need the entire page text but don't need the wikitext node type or structure data, have you tried the HTML dump file or the zim dump file?

kristian-clausal commented 3 months ago

I've used Akhenaton as the text to test these changes on, so it should be fine. I can't recreate the specific error you have there. I will try using your specific code.

Also a tip with regexes: using r-strings r"like this" allows you to type strings with escape characters (like \n or \) and other things so that the escape strings are not escaped like in a normal Python string; it's a 'raw' string literal. The regex for 'Fichier:' would be r'Fichier:' and then r'\s*Fichier\s*: because it turns out there can be whitespace in those places.

kristian-clausal commented 3 months ago

Yeah, this is the table removal stuff that hadn't been yet merged when you tried to merge it; there was still a pull request waiting, which is why I said to wait until tomorrow (it's now tomorrow morning in Europe). Please pull the newest commits and try again.

LeMoussel commented 3 months ago

Yes, I only need the entire text of the page, but not the wikitext node type or structure data.

Just like the result with the Wikipedia API action=query&prop=extracts|revisions&explaintext Example: Alpes-de-Haute-Provence

HTML format is complicated to parse. I tried different tools like Trafilatura, but the results are not relevant.

I don't know the ZIM format. But from what I quickly saw, it's also in HTML format.

LeMoussel commented 3 months ago

Pull:

dev@dev-B550M-DS3H:~/Python/WikiExtractor$ cd wiktextract
dev@dev-B550M-DS3H:~/Python/WikiExtractor/wiktextract$ git pull
Updating 5d6bb0e2..e992e954
Fast-forward
 src/wiktextract/clean.py | 23 +++++++++++++++++++----
 tests/test_clean.py      | 12 ++++++++++++
 2 files changed, 31 insertions(+), 4 deletions(-)
dev@dev-B550M-DS3H:~/Python/WikiExtractor/wiktextract$ cd ../wikitextprocessor
dev@dev-B550M-DS3H:~/Python/WikiExtractor/wikitextprocessor$ git pull
Already up to date.

Test on "Alpes-de-Haute-Provence". Same errors. Presence of [Alt: Page d’aide sur l’homonymie] and this [Alt: icône décorative]

Rem: Following your advice I modified all the regex's NB: I'm away until Monday.

xxyzz commented 3 months ago

HTML is complicated then wikitext, seriously? You could try HTML/XML parsers like lxml or Beautiful Soup to find the body element and use the element method or attribute to get the text.

kristian-clausal commented 3 months ago

The [Alt]s will remain. We can change the syntax if need be, but the alt-text needs to be distinguishable from 'normal' text somehow, and easily processed.

LeMoussel commented 3 months ago

Surprise ! With this code

        tree = self.wxr.wtp.parse(text="{{Voir homonymes|Aisne}}")
        text = clean_node(
            wxr=self.wxr,
            sense_data={},
            wikinode=tree,
            collect_links=False,
            node_handler_fn=clean_node_handler,
        )

text = '[Alt: Page d’aide sur l’homonymie]\nPour les articles homonymes, voir Aisne.' Which according to your comments is correct.

but with this code

        tree = self.wxr.wtp.parse(text="{{Voir homonymes|Aisne}}", expand_all=True)
        text = clean_node(
            wxr=self.wxr,
            sense_data={},
            wikinode=tree,
            collect_links=False,
            node_handler_fn=clean_node_handler,
        )

text = 'Pour les articles homonymes, voir Aisne.' [Alt] is gone ! What suits me

In the documentation, it would be interesting to have examples of the different results of using pre_expand & expand_all parameters of the parse() function.

kristian-clausal commented 3 months ago

Thank you for pointing out this mismatch, I will take a look at it.

LeMoussel commented 3 months ago

But I note that the use with expand_all=True adds other information in the text which does not interest me. As you indicate in https://github.com/tatuylonen/wikitextprocessor/issues/225#issuecomment-1996623492, I will remove the [Alt]s in post-processing.