Open LeMoussel opened 4 months ago
Probably same as https://github.com/tatuylonen/wiktextract/issues/533, we will check this bug next week.
Alpes-de-Haute-Provence
The output is OK. No errors
, warnings
and debugs
messages.
Akhenaton The output is KO. In the output presence of
|
|align="center" valign="middle"|
|}
| style="text-align:center;padding:2px;" |
| style="text-align:left;padding:2px;" |
|}
|
|align="center" valign="middle"|
|}
| style="text-align:center;padding:2px;" |
| style="text-align:left;padding:2px;" |
|}
debugs
message:
Akhenaton: DEBUG: HTML tag not properly closed at ['Akhenaton'] parsing Akhenaton/Règne/Révolution religieuse/Période noire ? started on line 136, detected on line 470 Akhenaton: DEBUG: HTML tag not properly closed at ['Akhenaton'] parsing Akhenaton/Règne/Révolution religieuse started on line 102, detected on line 470 Akhenaton: DEBUG: HTML tag not properly closed at ['Akhenaton'] parsing Akhenaton/Règne/Révolution religieuse started on line 100, detected on line 470
Anubis
The output is OK. No errors
, warnings
and debugs
messages.
Algèbre de Boole (logique)
The output is OK. No errors
, warnings
and debugs
messages.
Almost good :relaxed:, but for "Akhenaton" maybe it's another anomaly?
Currently downloading the French Wikipedia dump to make a new .db file with the updated data structure, so I'll be taking a look a look at this maybe by tomorrow.
I'm getting mostly correct output for Akhenaton. I've found issues, but not the ones that you have here.
Some of the image links, like in the homonym template at the start of the page, don't have alt texts at all; they take the last argument, class=noviewer
.
I also found a few broken table ends, |}
, pairs after another, which is probably what is left of the broken tables in your post. I'll take a deeper look tomorrow.
There's a PR for wiktextract that should take care of the last of the fixes I've attempted here.
clean_value() is supposed to remove wikitext tables {| ... |}
, which we will continue to do. However, HTML tag tables <table>...</table>
will be left, and will just render its contents on cell after another linearly. There hasn't been any need for wiktextract to handle this any better, and we'll keep it this way (at least for a while); if you do not want to see HTML tables or wikitext tables in the output, or do want to see both, they need to be handled with a node handling function. This might change in the future, if we change how clean_value is implemented, or create a separate thing (a library or a wikitextprocessor-specific cleaning function).
Other issues addressed: image links are rare in a Wiktionary context, so they weren't handled well, but now several issues should be fixed. There were other things but I can't actually remember what they were, so they couldn't have been that important!
Please pull the newest commits of wikitextprocessor and wiktextract sometime tomorrow and check if everything looks better now. The PR in wiktextract needs to be looked at by someone other than me because it involves negative lookahead in regex and I am already starting to see into the sixth dimension.
[Alt: Page d’aide sur l’homonymie]
and this [Alt: icône décorative]
[Alt: icône décorative] Portail des Alpes-de-Haute-Provence
[Alt: icône décorative] Portail des Alpes
- [ ] [Akhenaton](https://fr.wikipedia.org/wiki/Akhenaton) KO
Presence of `[Alt: Page d’aide sur l’homonymie]` and this (many times)
| |align="center" valign="middle"| |} | style="text-align:center;padding:2px;" | | style="text-align:left;padding:2px;" | |}
- [x] [Anubis](https://fr.wikipedia.org/wiki/Anubis) OK
- [ ] [Algèbre de Boole (logique)](https://fr.wikipedia.org/wiki/Alg%C3%A8bre_de_Boole_(logique)) KO
Presence of `[Alt: Page d’aide sur l’homonymie]`
[wikitext_parse-Akhenaton.txt](https://github.com/tatuylonen/wikitextprocessor/files/14588399/wikitext_parse-Akhenaton.txt)
[wikitext_parse-Algèbre de Boole (logique).txt](https://github.com/tatuylonen/wikitextprocessor/files/14588400/wikitext_parse-Algebre.de.Boole.logique.txt)
[wikitext_parse-Alpes-de-Haute-Provence.txt](https://github.com/tatuylonen/wikitextprocessor/files/14588401/wikitext_parse-Alpes-de-Haute-Provence.txt)
**Python Test code:**
```python
import re
import requests
from typing import Optional
# https://github.com/tatuylonen/wikitextprocessor/
from wikitextprocessor import (
Wtp,
NodeKind,
WikiNode,
)
# https://github.com/tatuylonen/wiktextract
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig
from wiktextract.page import clean_node
def clean_node_handler(node)-> Optional[str]:
"""Process nodes when encountering them.
For example by filtering them or changing them if needed."""
assert isinstance(node, WikiNode)
if node.kind == NodeKind.TEMPLATE:
if node.largs[0][0] in ['Semi-protection', 'Semi-protection longue', 'Confusion', 'coord']:
return ""
if re.match('Infobox', node.largs[0][0], re.I):
return ""
if re.match('Article', node.largs[0][0], re.I):
return ""
if re.match('Référence', node.largs[0][0], re.I):
return ""
if node.kind == NodeKind.LEVEL2:
if node.largs[0][0] in ['Annexes', 'Notes et références', 'Voir aussi']:
return ""
if node.kind == NodeKind.LINK:
if re.match('Fichier:', node.largs[0][0], re.I):
return ""
#if node.kind == NodeKind.HTML:
#print(node.sarg)
#if hasattr(node, 'largs') and len(node.largs) > 0:
# if node.largs[0][0] in ['=== Langues ===']:
return None
def template_handler(name, args_ht):
if len(args_ht) == 0:
return ""
return None
if __name__ == '__main__':
extension_tags = {
"maplink": {"parents": ["phrasing"], "content": ["phrasing"]},
"poem": {"parents": ["phrasing"], "content": ["phrasing"]},
"gallery": {"parents": ["phrasing"], "content": ["phrasing"]},
"graph": {"parents": ["phrasing"], "content": ["phrasing"]},
"mapframe": {"parents": ["phrasing"], "content": ["phrasing"]},
"timeline": {"parents": ["phrasing"], "content": ["phrasing"]},
}
wxr = WiktextractContext(
wtp = Wtp(
db_path="fr-wiki-latest.db",
lang_code="fr",
project="wikipedia",
extension_tags=extension_tags,
),
config=WiktionaryConfig(),
)
wiki_page_title = 'Alpes-de-Haute-Provence'
wiki_page = wxr.wtp.get_page(wiki_page_title)
wxr.wtp.start_page(wiki_page.title)
wxr.wtp.invoke_aliases = wxr.wtp.invoke_aliases | {"#invoque"}
info_log = f"Analyse: '{wiki_page_title}'\n"
wiki_nodes = wxr.wtp.parse(text=wiki_page.body)
text = clean_node(
wxr=wxr,
sense_data={},
wikinode=wiki_nodes,
collect_links=False,
node_handler_fn=clean_node_handler,
template_fn=template_handler,
)
if len(wxr.wtp.errors) > 0:
info_log += f"# Erreurs: {len(wxr.wtp.errors)}\n"
if len(wxr.wtp.warnings) > 0:
info_log += f"# Warnings: {len(wxr.wtp.warnings)}"
print(info_log)
with open(f'wikitext-{wiki_page_title}.txt', 'w', encoding='utf-8') as f:
f.write(wiki_page.body)
with open(f'wikitext_parse-{wiki_page_title}.txt', 'w', encoding='utf-8') as f:
f.write(text)
[Alt: something]
is the "alt" text of an image, this was added from a recent pr: https://github.com/tatuylonen/wiktextract/pull/539
I guess you only need the entire page text but don't need the wikitext node type or structure data, have you tried the HTML dump file or the zim dump file?
I've used Akhenaton as the text to test these changes on, so it should be fine. I can't recreate the specific error you have there. I will try using your specific code.
Also a tip with regexes: using r-strings r"like this"
allows you to type strings with escape characters (like \n or \
) and other things so that the escape strings are not escaped like in a normal Python string; it's a 'raw' string literal. The regex for 'Fichier:'
would be r'Fichier:'
and then r'\s*Fichier\s*:
because it turns out there can be whitespace in those places.
Yeah, this is the table removal stuff that hadn't been yet merged when you tried to merge it; there was still a pull request waiting, which is why I said to wait until tomorrow (it's now tomorrow morning in Europe). Please pull the newest commits and try again.
Yes, I only need the entire text of the page, but not the wikitext node type or structure data.
Just like the result with the Wikipedia API action=query&prop=extracts|revisions&explaintext
Example: Alpes-de-Haute-Provence
HTML format is complicated to parse. I tried different tools like Trafilatura, but the results are not relevant.
I don't know the ZIM format. But from what I quickly saw, it's also in HTML format.
Pull:
dev@dev-B550M-DS3H:~/Python/WikiExtractor$ cd wiktextract
dev@dev-B550M-DS3H:~/Python/WikiExtractor/wiktextract$ git pull
Updating 5d6bb0e2..e992e954
Fast-forward
src/wiktextract/clean.py | 23 +++++++++++++++++++----
tests/test_clean.py | 12 ++++++++++++
2 files changed, 31 insertions(+), 4 deletions(-)
dev@dev-B550M-DS3H:~/Python/WikiExtractor/wiktextract$ cd ../wikitextprocessor
dev@dev-B550M-DS3H:~/Python/WikiExtractor/wikitextprocessor$ git pull
Already up to date.
Test on "Alpes-de-Haute-Provence". Same errors.
Presence of [Alt: Page d’aide sur l’homonymie]
and this [Alt: icône décorative]
Rem: Following your advice I modified all the regex's NB: I'm away until Monday.
HTML is complicated then wikitext, seriously? You could try HTML/XML parsers like lxml or Beautiful Soup to find the body
element and use the element method or attribute to get the text.
The [Alt]s will remain. We can change the syntax if need be, but the alt-text needs to be distinguishable from 'normal' text somehow, and easily processed.
Surprise ! With this code
tree = self.wxr.wtp.parse(text="{{Voir homonymes|Aisne}}")
text = clean_node(
wxr=self.wxr,
sense_data={},
wikinode=tree,
collect_links=False,
node_handler_fn=clean_node_handler,
)
text
= '[Alt: Page d’aide sur l’homonymie]\nPour les articles homonymes, voir Aisne.'
Which according to your comments is correct.
but with this code
tree = self.wxr.wtp.parse(text="{{Voir homonymes|Aisne}}", expand_all=True)
text = clean_node(
wxr=self.wxr,
sense_data={},
wikinode=tree,
collect_links=False,
node_handler_fn=clean_node_handler,
)
text
= 'Pour les articles homonymes, voir Aisne.'
[Alt] is gone ! What suits me
In the documentation, it would be interesting to have examples of the different results of using pre_expand
& expand_all
parameters of the parse() function.
Thank you for pointing out this mismatch, I will take a look at it.
But I note that the use with expand_all=True
adds other information in the text which does not interest me.
As you indicate in https://github.com/tatuylonen/wikitextprocessor/issues/225#issuecomment-1996623492, I will remove the [Alt]
s in post-processing.
In some analyzed texts we have the presence of characters such as
{{!-}}
,}}|
,{{{blbla bla|}}}
etc.For example in the following articles: