marytts / marytts-testing

Functional tests for MaryTTS
0 stars 5 forks source link

Replace python script with SAX (or similar) parser to extract wiki markup from compressed dumps #3

Open psibre opened 9 years ago

psibre commented 9 years ago

See also sax branch in my fork.

psibre commented 8 years ago

@attacomsian any updates?

psibre commented 8 years ago

(sorry, accidentally closed it for a moment there)

attacomsian commented 8 years ago

I am working on it. I am able understand the Python code almost 70% and started writing new code in Java. So far, I setup project, read the input file via SAX reader, created multiple threads to process the Wiki pages parallel. Now working on extracting plain text from Wiki pages which is a little tricky. Since I have no prior experience of working in Python, it may slow down the progress. But it should not take a lot of time.

psibre commented 8 years ago

@attacomsian thanks for the update. I don't think that we need multithreading at this point, because it's invoked from gradle, which has some parallelization built-in already. Just extract the raw mediawiki content from the compressed dump tarball to individual files is enough for this task.

The conversion from mediawiki to plain text would be a separate issue, but that can be done using some third-party library.

The python script doesn't need to be ported to java, we should just get rid of it.

attacomsian commented 8 years ago

Okay. That is already done then.

On Tue, Nov 3, 2015, 12:42 PM psibre notifications@github.com wrote:

@attacomsian https://github.com/attacomsian thanks for the update. I don't think that we need multithreading at this point, because it's invoked from gradle, which has some parallelization built-in already. Just extract the raw mediawiki content from the compressed dump tarball to individual files is enough for this task.

The conversion from mediawiki to plain text would be a separate issue, but that can be done using some third-party library http://medialab.di.unipi.it/wiki/Wikipedia_Extractor.

The python script doesn't need to be ported to java, we should just get rid of it.

— Reply to this email directly or view it on GitHub https://github.com/marytts/marytts-testing/issues/3#issuecomment-153327378 .

attacomsian commented 8 years ago

I'll test it further and will finish it tomorrow.

attacomsian commented 8 years ago

Can we use 3rd party library to uncompress the dump tarball like Apache Common Compress? I tried to do it using Java native libraries but they only support .zip extension at the moment.

psibre commented 8 years ago

Sure! You may have to put them on the buildscript classpath before you can use them in the task, i.e., put

buildscript {
  repositories {
    jcenter()
  }
  dependencies {
    classpath "org.apache.commons:commons-compress:1.10"
  }
}

at the top of the build.gradle. @seblemaguer can also help.