seanjensengrey / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
0 stars 0 forks source link

Incorrect characters in Extractor output #53

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I have a one-liner trying to extract a hungarian site with special charaters 
like "ő" "ű".

Command line query is this:
# java de/l3s/boilerpipe/demo/ExtractMe 
http://sportgeza.hu/2012/london/cikkek/nem_schmitt_pal_hagyta_jova_a_rossz_himnu
szt

And here's my code:
# cat de/l3s/boilerpipe/demo/ExtractMe.java
package de.l3s.boilerpipe.demo;

import java.net.URL;
import de.l3s.boilerpipe.extractors.ArticleExtractor;

public class ExtractMe {
    public static void main(final String[] args) throws Exception {
        final URL url = new URL(args[0]);
        System.out.println(ArticleExtractor.INSTANCE.getText(url));
    }
}

*(partial) Extracted content:
... megfelel? himnuszt játszák a magyar gy?ztesek tiszteletére, akikb?l 
remélik, hogy minél több lesz...

In the extracted text "?"-s should be "ő" characters, but in the end of the 
extraction, all I get is 3F in hexa, which is the question mark.

I'm under 
#uname FreeBSD pdfgen 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Mon Jul 19 02:36:49 
UTC 2010     root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64
# java -version
java version "1.6.0_07"
Diablo Java(TM) SE Runtime Environment (build 1.6.0_07-b02)
Diablo Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)

Been working on a solution for days, but I can't seem to find a reason why it 
wouldn't work :/

BTW, curl outputs characters beautifully when called on an UTF-8 terminal,
but boilerpipe fails to display even those special characters, which were good 
at first.

I'd appreciate any help/ideas, best
M

Original issue reported on code.google.com by mihaly.k...@gmail.com on 31 Jul 2012 at 3:38

GoogleCodeExporter commented 9 years ago
..and I'm using boilerpipe 1.2.0

Original comment by mihaly.k...@gmail.com on 31 Jul 2012 at 3:39

GoogleCodeExporter commented 9 years ago
Hello, did you manage to solve it on your own?

Original comment by tsz...@gmail.com on 10 Sep 2012 at 4:08

GoogleCodeExporter commented 9 years ago
Hello, not really. I use php to analyze the output of boilerpipe, and estimate 
the charset, but the ideal case would be if I wouldn't have to do that.
I found a shell wrapper for boilerpipe though which seemed to work: 
https://github.com/theneubeck/boilerpipe-server
It didn't fit my needs so I decided to use a php middle layer, but some might 
find it helpful.

Original comment by mihaly.k...@gmail.com on 10 Sep 2012 at 6:36

GoogleCodeExporter commented 9 years ago
Found the solution:
Here is the java code needed to fix the special charaters issue:

public class ExtractMe {

public static void main(final String[] args) throws Exception {
BufferedReader in = new BufferedReader(new 
InputStreamReader(System.in,"UTF-8"));
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(ArticleExtractor.INSTANCE.getText(in));

}
}

Original comment by mihaly.k...@gmail.com on 18 Sep 2013 at 1:20