maodun1978sohu / java-axp

Automatically exported from code.google.com/p/java-axp
0 stars 0 forks source link

Create an Apache Tika parser class #17

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Apache Tika (http://tika.apache.org/) is the leading open source text 
extraction framework, written in Java. It allows extracting text from a lot of 
formats including PDF, DOC, ODF and 30 more.

Tika is modular, and it only takes one Java class along with one property file 
to write a parser wrapper for Tika. I think java-axp could easily be exposed as 
a Tika parser plugin with just a few hours work, and will enable all Tika users 
to parse the XPS format.

See this example for how this is done to wrap the MS TNEF format: 
http://github.com/jukka/jtnef/blob/master/src/net/freeutils/tnef/tika/TNEFParser
.java

Original issue reported on code.google.com by cominv...@gmail.com on 4 Oct 2010 at 9:58