Open psibre opened 9 years ago
@attacomsian any updates?
(sorry, accidentally closed it for a moment there)
I am working on it. I am able understand the Python code almost 70% and started writing new code in Java. So far, I setup project, read the input file via SAX reader, created multiple threads to process the Wiki pages parallel. Now working on extracting plain text from Wiki pages which is a little tricky. Since I have no prior experience of working in Python, it may slow down the progress. But it should not take a lot of time.
@attacomsian thanks for the update. I don't think that we need multithreading at this point, because it's invoked from gradle, which has some parallelization built-in already. Just extract the raw mediawiki content from the compressed dump tarball to individual files is enough for this task.
The conversion from mediawiki to plain text would be a separate issue, but that can be done using some third-party library.
The python script doesn't need to be ported to java, we should just get rid of it.
Okay. That is already done then.
On Tue, Nov 3, 2015, 12:42 PM psibre notifications@github.com wrote:
@attacomsian https://github.com/attacomsian thanks for the update. I don't think that we need multithreading at this point, because it's invoked from gradle, which has some parallelization built-in already. Just extract the raw mediawiki content from the compressed dump tarball to individual files is enough for this task.
The conversion from mediawiki to plain text would be a separate issue, but that can be done using some third-party library http://medialab.di.unipi.it/wiki/Wikipedia_Extractor.
The python script doesn't need to be ported to java, we should just get rid of it.
— Reply to this email directly or view it on GitHub https://github.com/marytts/marytts-testing/issues/3#issuecomment-153327378 .
I'll test it further and will finish it tomorrow.
Can we use 3rd party library to uncompress the dump tarball like Apache Common Compress? I tried to do it using Java native libraries but they only support .zip extension at the moment.
Sure! You may have to put them on the buildscript classpath before you can use them in the task, i.e., put
buildscript {
repositories {
jcenter()
}
dependencies {
classpath "org.apache.commons:commons-compress:1.10"
}
}
at the top of the build.gradle
.
@seblemaguer can also help.
See also
sax
branch in my fork.