tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
825 stars 87 forks source link

Error template tree for single page without cache #184

Closed ilyafreer closed 1 year ago

ilyafreer commented 2 years ago
#path = 'wikitext.txt'
#for example wikitext.txt equal https://en.wiktionary.org/w/index.php?title=love&action=edit

def parse(term_name, path):
        with open(path) as f:
            text = f.read()

        ctx = Wtp(num_threads=None, cache_file=None, quiet=False, lang_code="en") #not work
        # ctx = Wtp(num_threads=None, cache_file=WTP_CACHE_FILENAME) will work
        ctx.add_page('wikitext', term_name, text, True)
        ctx.analyze_templates()

        wikt_config = WiktionaryConfig(capture_linkages=False)
        return parse_page(ctx, term_name, text, wikt_config)

will return empty data

kristian-clausal commented 2 years ago

I tried, but couldn't replicate this. Please provide more information about what you are doing, I'll put here what I did in the Python interactive interpreter:

EDIT: I ran the interactive interpreter in my wiktextract repository directory where wiktwords is; wikitextprocessor and wiktextract are in my Python path, so not using the ones installed with pip (which are always out of date).

$ python
Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from wikitextprocessor import Wtp
>>> from wiktextract import WiktionaryConfig
>>> from wiktextract.page import parse_page
>>> 
>>> 
>>> def parse(term_name, path):
...     with open(path) as f:
...         text = f.read()
...     ctx=Wtp(num_threads=None, cache_file=None, quiet=False, lang_code="en")
...     ctx.add_page('wikitext', term_name, text, True)
...     ctx.analyze_templates()
...     wikt_config = WiktionaryConfig(capture_linkages=False)
...     return parse_page(ctx, term_name, text, wikt_config)
... 
>>> parse("subscriber", "p/su/subscriber.txt")
subscriber/English/noun: DEBUG: skipping string in the middle of translations: Template:trans-mid at ['subscriber']
subscriber/English/noun: DEBUG: skipping string in the middle of translations: Template:trans-bottom at ['subscriber']
[{'senses': [{'raw_glosses': ['A person who subscribes to a publication or a service.'], 'glosses': ['A person who subscribes to a publication or a service.']}, {'raw_glosses': ['A system or component that subscribes to something, such as an event, made available by a publisher.'], 'examples': [{'text': "An event aggregator facilitates a fire-and-forget model of communication. The object triggering the event doesn't care if there are any subscribers.", 'ref': '2013, Addy Osmani, Developing Backbone.js Applications (page 175)'}], 'glosses': ['A system or component that subscribes to something, such as an event, made available by a publisher.']}], 'pos': 'noun', 'etymology_text': '', 'etymology_templates': [], 'categories': ['en:People'], 'word': 'subscriber', 'lang': 'English', 'lang_code': 'en'}]

First I thought you might be just missing a step, like ctx.start_page() or something similar, but then your original code just worked for me; so then I thought I might be accidentally using the cache, but renaming wikt-cache and the pickled file didn't change things. I tested with pages/Words/lo/love.txt (and then some others), and it worked.

ilyafreer commented 2 years ago

Yes. You right, it's work. But I need get synonyms and antonyms values for senses like this Hq23Zf1.md.png

I did fork for it https://github.com/tatuylonen/wiktextract/commit/4a76ff41c5f0efa2f44d4e89794093b34b48feb1

After ran this command from Django

import os
import environ
from django.conf import settings
from django.core.management.base import BaseCommand
from wikitextprocessor import Wtp
from wiktextract import WiktionaryConfig
from wiktextract.page import parse_page

WTP_CACHE_FILENAME = settings.PROJECT_ROOT + "/../cache/wtp-cache"

class Command(BaseCommand):
    def __init__(self):
        super().__init__()
        self.env = environ.Env()
        environ.Env.read_env()

    def handle(self, *args, **options):
        path = '/usr/src/app/wiktionary_parser/tests/sense_ralations/love.txt'
        term_name = 'love'
        self.stdout.write(self.style.NOTICE('Command started!'))

        with open(path) as f:
            text = f.read()

        ctx=Wtp(num_threads=None, cache_file=None, quiet=False, lang_code="en")
        # ctx = Wtp(num_threads=None, cache_file=WTP_CACHE_FILENAME)
        ctx.add_page('wikitext', term_name, text, True)
        ctx.analyze_templates()
        wikt_config = WiktionaryConfig(capture_linkages=False)
        parsed =  parse_page(ctx, term_name, text, wikt_config)
        print(parsed)

I had log file. In first run with cache_file in log has nodes related_synonyms and related_antonyms with right values Second run without cache_file and output in log file have many differences and nodes related_synonyms and related_antonyms have empty values

_part of log with running with cachefile

"related_synonyms":[
         {
            "sense":"have a strong affection for",
            "synonyms":[
               "adore",
               "cherish",
               "love"
            ]
         },
         {
            "sense":"have sexual intercourse with",
            "synonyms":[
               "enjoy",
               "go to bed with",
               "sleep with",
               "copulate with"
            ]
         }
      ],

_part of log with running without cachefile

"related_synonyms":[
         {
            "sense":[

            ],
            "synonyms":[
               "",
               "",
               "love"
            ]
         },

Examples log I commited

How can I get this data without cache?

Thank you much!

ilyafreer commented 1 year ago

Here https://github.com/tatuylonen/wiktextract/blob/330124aaa2d8c0a0f8494c3b36102de6f0966a1d/wiktextract/page.py#L3027 without cache return half-empty data for block Synonyms and Antonyms levels tree

<LEVEL5(['Synonyms']){} '\n', <LIST(*){} <LIST_ITEM(*){} ' ', <HTML(strong){'class': 'error'} 'Template:sense'>, ' ', <HTML(strong){'class': 'error'} 'Template:l'>, ', ', <HTML(strong){'class': 'error'} 'Template:l'>, '; see also ', <LINK(['Thesaurus:love']){} >, '\n'>, <LIST_ITEM(*){} ' ', <HTML(strong){'class': 'error'} 'Template:sense'>, ' ', <HTML(strong){'class': 'error'} 'Template:l'>, ', ', <HTML(strong){'class': 'error'} 'Template:l'>, ', ', <HTML(strong){'class': 'error'} 'Template:l'>, '; see also ', <LINK(['Thesaurus:copulate with']){} >, '\n'>>, '\n'>, <LEVEL5(['Antonyms']){} '\n', <LIST(*){} <LIST_ITEM(*){} ' ', <LINK(['hate']){} >, ', ', <LINK(['despise']){} >, ', ', <LINK(['fear']){} >, '\n'>>, '\n'>, <LEVEL5(['Derived terms']){} '\n', <HTML(strong){'class': 'error'} 'Template:der4'>, '\n\n'>, <LEVEL5(['Related terms']){} '\n', <HTML(strong){'class': 'error'} 'Template:rel3'>, '\n\n'>,

ilyafreer commented 1 year ago

The problem is the lack of templates when parsing without a cache https://github.com/tatuylonen/wikitextprocessor/blob/bd361d75d2f0c5f1a19b7c97f64729a5dda31153/wikitextprocessor/core.py#L1174

ilyafreer commented 1 year ago

if added manually templates for sense and l sections
{"sense": "{{#if:{{{11|}}}|{{#invoke:qualifier/templates|qualifier_t}}|{{qualifier|{{{1|{{error|A parameter must be given to the sense template.}}}}}|{{{2|}}}|{{{3|}}}|{{{4|}}}|{{{5|}}}|{{{6|}}}|{{{7|}}}|{{{8|}}}|{{{9|}}}|{{{10|}}}}}}}<span class=\"ib-colon sense-qualifier-colon\">:</span>", "l": "{{#invoke:links/templates|l_term_t}}{{#ifeq:{{PAGENAME}}|RecentChanges||{{#ifeq:{{{1|}}}|und|[[Category:Undetermined language links]]}}}}{{redlink category|{{{1|}}}|{{{2|}}}|template=l}}"} Got error

love: DEBUG: unexpected top-level node: <HTML(strong){'class': 'error'} 'Template:also'> at ['love']
love/English/verb: ERROR: LUA error in #invoke ('links/templates', 'l_term_t') parent ('Template:l', {1: 'en', 2: 'adore'}) at ['love', 'l', '#invoke']
[string "_sandbox_phase2"]:206: Could not find module links/templates: module not found
stack traceback:
        [C]: in function 'error'
        [string "_sandbox_phase2"]:206: in function <[string "_sandbox_phase2"]:140>
        Traceback (most recent call last):
        File "/pyroot/wikitextprocessor/wikitextprocessor/luaexec.py", line 625, in call_lua_sandbox
    ret = ctx.lua_invoke(modname, modfn, frame, ctx.title, timeout)
        File "lupa/_lupa.pyx", line 587, in lupa._lupa._LuaObject.__call__
        File "lupa/_lupa.pyx", line 1333, in lupa._lupa.call_lua
        File "lupa/_lupa.pyx", line 1359, in lupa._lupa.execute_lua_call
        File "lupa/_lupa.pyx", line 1295, in lupa._lupa.raise_lua_error
        lupa._lupa.LuaError: [string "_sandbox_phase2"]:206: Could not find module links/templates: module not found
stack traceback:
        [C]: in function 'error'
        [string "_sandbox_phase2"]:206: in function <[string "_sandbox_phase2"]:140>
xxyzz commented 1 year ago

I don't think it's possible to expand templates just using Wikitext of a sing page, you have to provide all the used templates and modules text/code. In these case, the "sense" template and the "links/templates" module. The cache file is normally used for debugging for a single page, if you're extracting the hole dump file then cache file isn't needed.

kristian-clausal commented 1 year ago

I'll ask Tatu if its feasible to create a toggle or to make it work without the cache.

kristian-clausal commented 1 year ago

Sorry for the late reply, circumstances circumstanced.

Tatu confirmed that without the pre-generated cache, it's not technically feasible to get anything from other pages, like templates or links. The cache is just a saved version of one of the first steps of the whole wiktextract process, creating the data that links all of these together correctly, so if you didn't have the cache processing a single page would just do all the work of generating the cache, and then throw it away after processing the page. Currently, that preprocessing cache-generating step is skipped when processing one page, which means that running it with one page takes seconds instead of tens of minutes or hours, or days.

I guess a way to make this easier for users is to just... make a version of wikt-cache and wikt-cache.pickle downloadable from kaikki.org? At least for en.wiktionary. The problem there is that the wikt-cache is intimately tied to a very specific wiktionary dump file, and you'd need to extract all the pages/ files anyhow (with wiktextract, I'm not sure if pages/ has the same directory structure as what is inside the dump files)... I'll ask tatu about this.

Closing this for now.