sorenmacbeth / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
0 stars 0 forks source link

IDN <-> ACE Domain Names #3

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi,

this module is incredible good but it cannot handle domains names with
(german) "Umlaute" (Ä, Ö, Ü, ...). Any ideas how to deal with this problem?

Thanks,
Felix.

Original issue reported on code.google.com by feliz...@gmx.de on 21 Jan 2010 at 2:32

GoogleCodeExporter commented 9 years ago
Hi Felix,

if I understand correctly you called DefaultExtractor#getText(URL) with an URL 
like "http://www.äöü.xyz/". 
This seems to be unsupported by Java 6 (see 
http://java.sun.com/docs/books/tutorial/i18n/network/iri.html 
). In particular, what you passed was then an IRI, not a URL.

A workaround for now could be creating the URLs like this
URL u = new URL("http://"+IDN.toASCII("www.äöü.xyz")+"/");

However, since the getText(URL) method is explicitly marked as "show case 
only", you might also consider 
using a dedicated HTTP client library, such as HttpClient 
(http://hc.apache.org/) and call getText(InputSource) 
instead.

I would not recommend using getText(URL) in a production setup. You will sooner 
or later run into problems 
that are out of scope for boilerpipe (robots.txt, broken servers, proxies, ...)

Marking as WontFix.

Best,
Christian

Original comment by ckkohl79 on 24 Jan 2010 at 4:09