molybdenum-99 / infoboxer

Wikipedia information extraction library
MIT License
174 stars 16 forks source link

Parse templates in Wiktionary #88

Open Nakilon opened 3 years ago

Nakilon commented 3 years ago

I wonder how do I use it to get the Wiktionary data. For example an Etymology section for "Russian".

Infoboxer.wiktionary.get("Russian").sections.first.sections("Etymology").text

=> "{der:en|ML.|-} (11th century) {m:la|Russiānus}, the adjective of {m:la|Russia}, a Latinization of the {der:en|orv|Русь}. Attested in English (both as a noun and as an adjective) from the 16th century.\n\n"

How do I replace those templates like {der:en|ML.|-} with their real meaning to get:

Medieval Latin (11th century) Russiānus, the adjective of Russia, a Latinization of the Old East Slavic Русь (Rusĭ). Attested in English (both as a noun and as an adjective) from the 16th century.

zverok commented 2 years ago

There is (kind of) answer, but you wouldn't like it :( The thing you want is called "template expansion", and Infoboxer can't do it by itself (it was meant for "template extraction" rather), so you'll need to call low-level API:

Infoboxer.wiktionary.api.expandtemplates.text('{{m|la|Russiānus}}').prop(:wikitext).response['wikitext']
# => <i class="Latn mention" lang="la">[[Russianus#Latin|Russiānus]]</i>

...unfortunately, to do so, you'll need the template source, and Infoboxer, somewhat dumbly, doesn't provide a way to do it. The best guess is to imitate it by recreating:

class Infoboxer::Tree::Template
  def source
    [
      "{{#{name}",
      *unnamed_variables.map(&:text),
      *named_variables.map { |v| "#{v.name}=#{v.text}"},
    ].join('|') + '}}'
  end
end

wiktionary = Infoboxer.wiktionary
section = wiktionary.get("Russian").sections.first.sections("Etymology")

section.templates.map(&:source).each { |t|
  puts t
  puts wiktionary.api.expandtemplates.text(t).prop(:wikitext).response['wikitext']
}

...this will print

{{der|en|ML.|-}}
<span class="etyl">[[w:Medieval Latin|Medieval Latin]][[Category:English terms derived from Medieval Latin|API]]</span>
{{m|la|Russiānus}}
<i class="Latn mention" lang="la">[[Russianus#Latin|Russiānus]]</i>
{{m|la|Russia}}
<i class="Latn mention" lang="la">[[Russia#Latin|Russia]]</i>
{{der|en|orv|Русь}}
<span class="etyl">[[w:Old East Slavic|Old East Slavic]][[Category:English terms derived from Old East Slavic|API]]</span> <i class="Cyrs mention" lang="orv">[[Русь#Old East Slavic|Русь]]</i> <span class="mention-gloss-paren annotation-paren">(</span><span lang="orv-Latn" class="mention-tr tr Latn">Rusĭ</span><span class="mention-gloss-paren annotation-paren">)</span>

...but, unfortunately again, in extracting readable text from it you are on your own mostly. Though, Infoboxer's parser can provide a bit of help:

section.templates.map(&:source).each { |t|
  print "expanding `#{t}`: "
  expanded = wiktionary.api.expandtemplates.text(t).prop(:wikitext).response['wikitext']
  puts Infoboxer::Parser.inline(expanded).text
}

output:

expanding `{{der|en|ML.|-}}`: Medieval LatinAPI
expanding `{{m|la|Russiānus}}`: Russiānus
expanding `{{m|la|Russia}}`: Russia
expanding `{{der|en|orv|Русь}}`: Old East SlavicAPI Русь (Rusĭ)

(yeah, those API provided by [[Category: links are weird, but it is what it is)