inarahd / jwktl

Automatically exported from code.google.com/p/jwktl
0 stars 0 forks source link

Etymology paragraph stripped of word hyperlinks #11

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Output the etymological information for a word.

What is the expected output? What do you see instead?

I did not know what to expect, so I output the etymology information for all 
words in the db. These are the first lines from the file created (most entries 
are like this):

English:dictionary::    , from , from , from , perfect past participle of + .
English:dictionary::    , from , from , from , perfect past participle of + .
English:free::  From , from , , from , . Compare West Frisian , Dutch , German 
, Danish .The verb comes from .
English:free::  From , from , , from , . Compare West Frisian , Dutch , German 
, Danish .The verb comes from .

What version of the product are you using? On what operating system?

I am using jwktl-1.0.1 as a Maven artifact on a Ubuntu 12.04 machine, on 
Wiktionary dump enwiktionary-20141004-pages-articles.xml

Please provide any additional information below.

The process runs smoothly, all other output information seems fine. This looks 
like an "overly eager" clean-up of the paragraphs, since etymological 
information is given in a slightly non-standard format. I am not sure if this 
format changed over time, or the etymological information was always provided 
like this.

Original issue reported on code.google.com by nast...@fbk.eu on 26 Jan 2015 at 11:28

GoogleCodeExporter commented 9 years ago
Hi. If you access the etymologies using IWiktionaryEntry.getWordEtymology(), 
you will obtain a IWikiString representation of the etymology. This class 
provides both a getPlainText() and a getText() method to obtain a string 
representation of the etymology. I assume that in your code you used the former 
(or an implicit toString(), which uses getPlainText(), too). The latter, 
however, allows you to work with the full markup encoded in Wiktionary. And 
yes: getPlainText() is too eager for etymology strings. I'm not sure if a plain 
text representation is necessary at all if you have the markup version. I have 
been experimenting with a EtymologyTemplateHandler for a while - you can find 
it in the api.util.TemplateParser file - using this methodology, it should be 
possible to analyze etymology strings. It's far from perfect, but probably a 
good starting point. If you make interesting changes to the JWKTL source code, 
I'm happy to integrate it. Just reopen this ticket or start a new one. Best 
wishes!

Original comment by chmeyer.de on 4 Feb 2015 at 3:07

GoogleCodeExporter commented 9 years ago
Thanks Christian! I wasn't planning to make changes, I was just trying to
see how much of the info in WIktionary is formalized. I am allergic to
databases :)

all the best,
Vivi

Original comment by nast...@fbk.eu on 4 Feb 2015 at 3:16