Closed GoogleCodeExporter closed 9 years ago
It seems that utf-8 is currently hardcoded as the only charset.
edu.uci.ics.crawler4j.crawler.Page
this.html += Charset.forName("utf-8").decode(this.bBuf);
Original comment by 1969yuri...@gmail.com
on 18 Apr 2010 at 8:59
Yeah I saw that after i postet this issue, then let's see it as another
enhancement.
I did not find a way to crawl iso-8859-1 pages in a correct manner. Even if i
try to
re-encode the UTF-8-Text as iso-8859-1.
Example:
// field is the variable to "re-encode"
CharBuffer cb =
Charset.forName("ISO-8859-1").newDecoder().decode(ByteBuffer.wrap(field.getBytes
()));
field = cb.toString();
Original comment by andreas....@googlemail.com
on 18 Apr 2010 at 10:14
[deleted comment]
In the ideal case, crawler should detect the character set automatically. This
is not
a trivial task. The character set is not always specified in the meta tag. The
solution is to statistically compare the characters you see with the characters
in
known character sets. Mozilla has an open source implementation of it and in
the past
I had it included. But it comes with a little overhead and since most of the
people
(including me) only crawl utf-8 I removed it.
So, if I you only need to crawl a specific character set I can put it in the
config file.
Original comment by ganjisaffar@gmail.com
on 18 Apr 2010 at 6:02
Great! Since I am using a really specific implementation with database (and of
course
seeds to crawl stored there), for my case (and i guess in many other cases) it
would
be most elegant to add an optional parameter in the CrawlController.addSeed();
method.
Example:
controller.addSeed("http://www.foo.com", "iso-8859-1");
controller.addSeed("http://www.bar.com", "utf-8");
controller.addSeed("http://www.foobar.com"); // assumes that it is utf-8
JFI, if you don't know: There is a java port of Mozillas character detection. I
did
not check the quality of this implementation.
http://jchardet.sourceforge.net/
Original comment by andreas....@googlemail.com
on 19 Apr 2010 at 6:14
Your case is very specific and I'm not going to implement it that way. But I
will try
to add the charset detection in one week (I'm very busy now). If you need it
sooner,
you can checkout the source codes and work on it yourself.
Original comment by ganjisaffar@gmail.com
on 19 Apr 2010 at 6:41
modify resources/crawler4j.properties, it helps, although still hardcoded.
Original comment by Qiuyan...@gmail.com
on 14 Dec 2010 at 2:11
i think fetch charset from respose header, if not charset find,then fetch from
meta setting like below:
private static final String HTML_META_CHARSET_REGEX =
"(<meta\\s*http-equiv\\s*=\\s*(\"|')content-type(\"|')\\s*content\\s*=\\s*(\"|')text/html;\\s*charset\\s*=\\s*(.*?)(\"|')/?>)";
if (charset == null) {
charset = scraper.getConfiguration().getCharset();
}
Original comment by wanxiang.xing@gmail.com
on 19 Mar 2011 at 3:49
I have modified your source to automatically detect encoding. It tries by
checking http header first, then by checking meta tag and in the end by
checking xml tag. (http://en.wikipedia.org/wiki/Character_encodings_in_HTML)
If non of these checks are successful, default encoding is used.
It can be disabled by setting option crawler.detect_encoding to false.
Here is the patch, you are free to use as you will, but of course you might
first check if design suits to yours.
Original comment by SasaVi...@gmail.com
on 28 Apr 2011 at 10:51
Attachments:
Yes your patch works well for me. I had similar problems with german but now it
works. Thx for this patch
Original comment by frank.ro...@gmail.com
on 25 Sep 2011 at 4:47
As of version 3.0, crawler4j automatically detects the encoding.
-Yasser
Original comment by ganjisaffar@gmail.com
on 2 Jan 2012 at 3:54
Original issue reported on code.google.com by
andreas....@googlemail.com
on 17 Apr 2010 at 4:30