Java API usage: how to parse a single article ?

idio / json-wikipedia

Json Wikipedia, contains code to convert the Wikipedia xml dump into a json dump. Questions? https://gitter.im/idio-opensource/Lobby

17 stars 2 forks source link

Java API usage: how to parse a single article ? #43

Closed thomasopsomer closed 7 years ago

thomasopsomer commented 7 years ago

Hi,

I'm trying to parse single wikipedia xml file. Like the mercedes.xml in the test of this repo. Following the code in the test section I tried something like:

import it.cnr.isti.hpc.wikipedia.article.Article
import it.cnr.isti.hpc.wikipedia.parser.ArticleParser

val parser = new ArticleParser("en")
val testXml = IOUtils.getFileAsUTF8String("./mercedes.xml")
val testArticle = new Article()
parser.parse(testArticle, testXml)

But the result is strange. Many properties are blank, like title, wikiTitle, ... and paragraphs / clean text are also wrongly parsed. I guess I'm doing something wrong ^^ If you could show some usage of the API to process a single article in xml it would be very great :)

Thanks, Thomas

diegoceccarelli commented 7 years ago

@thomasopsomer could you please post the ./mercedes.xml file?

thomasopsomer commented 7 years ago

This file: https://github.com/idio/json-wikipedia/blob/development/src/test/resources/en/mercedes.xml

Looking again I see that the tests are using the mercedes.txt. So do I need to give the text field of the xml to the parser ?

tgalery commented 7 years ago

hi @thomasopsomer , you answered your own question. The parse of a single article means parsing the value of the text field of the page node in the xml. So if you pass it to the parse method of the parser, you should get decent results. Let us know if you don't.

diegoceccarelli commented 7 years ago

thomasopsomer commented 7 years ago

Cool it's working now ! At first sight I didn't understand that all information was in the text field. I've been using json-wikipedia for while but always as a black box ! Anyway thanks for the help, and thanks for implementing the "links with span offsets" feature :)

tgalery commented 7 years ago

Closing this issue then.