seantanwh / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Failure on Non-UTF-8 pages #7

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Add a seed with non-utf-8 content
2. Start the crawl

What is the expected output? What do you see instead?
Expected: Correct output / load of iso-8859-1 pages
Instead: Wrong Characters.

What version of the product are you using? On what operating system?
1.8.1, Windows XP

Please provide any additional information below.
The crawler seems to work only with utf-8 pages

Original issue reported on code.google.com by andreas....@googlemail.com on 17 Apr 2010 at 4:30

GoogleCodeExporter commented 8 years ago
It seems that utf-8 is currently hardcoded as the only charset. 

edu.uci.ics.crawler4j.crawler.Page

this.html += Charset.forName("utf-8").decode(this.bBuf);

Original comment by 1969yuri...@gmail.com on 18 Apr 2010 at 8:59

GoogleCodeExporter commented 8 years ago
Yeah I saw that after i postet this issue, then let's see it as another 
enhancement.
I did not find a way to crawl iso-8859-1 pages in a correct manner. Even if i 
try to
re-encode the UTF-8-Text as iso-8859-1. 

Example:
// field is the variable to "re-encode"
CharBuffer cb =
Charset.forName("ISO-8859-1").newDecoder().decode(ByteBuffer.wrap(field.getBytes
()));
field = cb.toString();

Original comment by andreas....@googlemail.com on 18 Apr 2010 at 10:14

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
In the ideal case, crawler should detect the character set automatically. This 
is not
a trivial task. The character set is not always specified in the meta tag. The
solution is to statistically compare the characters you see with the characters 
in
known character sets. Mozilla has an open source implementation of it and in 
the past
I had it included. But it comes with a little overhead and since most of the 
people
(including me) only crawl utf-8 I removed it.

So, if I you only need to crawl a specific character set I can put it in the 
config file.

Original comment by ganjisaffar@gmail.com on 18 Apr 2010 at 6:02

GoogleCodeExporter commented 8 years ago
Great! Since I am using a really specific implementation with database (and of 
course
seeds to crawl stored there), for my case (and i guess in many other cases) it 
would
be most elegant to add an optional parameter in the CrawlController.addSeed(); 
method.

Example:
controller.addSeed("http://www.foo.com", "iso-8859-1");
controller.addSeed("http://www.bar.com", "utf-8");
controller.addSeed("http://www.foobar.com"); // assumes that it is utf-8

JFI, if you don't know: There is a java port of Mozillas character detection. I 
did
not check the quality of this implementation.

http://jchardet.sourceforge.net/

Original comment by andreas....@googlemail.com on 19 Apr 2010 at 6:14

GoogleCodeExporter commented 8 years ago
Your case is very specific and I'm not going to implement it that way. But I 
will try
to add the charset detection in one week (I'm very busy now). If you need it 
sooner,
you can checkout the source codes and work on it yourself.

Original comment by ganjisaffar@gmail.com on 19 Apr 2010 at 6:41

GoogleCodeExporter commented 8 years ago
modify resources/crawler4j.properties, it helps, although still hardcoded.

Original comment by Qiuyan...@gmail.com on 14 Dec 2010 at 2:11

GoogleCodeExporter commented 8 years ago
i think fetch charset from respose header, if not charset find,then fetch from 
meta setting like below:  
private static final String HTML_META_CHARSET_REGEX =
        "(<meta\\s*http-equiv\\s*=\\s*(\"|')content-type(\"|')\\s*content\\s*=\\s*(\"|')text/html;\\s*charset\\s*=\\s*(.*?)(\"|')/?>)";
 if (charset == null) {
            charset = scraper.getConfiguration().getCharset();
        }

Original comment by wanxiang.xing@gmail.com on 19 Mar 2011 at 3:49

GoogleCodeExporter commented 8 years ago
I have modified your source to automatically detect encoding. It tries by 
checking http header first, then by checking meta tag and in the end by 
checking xml tag. (http://en.wikipedia.org/wiki/Character_encodings_in_HTML)
If non of these checks are successful, default encoding is used. 

It can be disabled by setting option crawler.detect_encoding to false. 

Here is the patch, you are free to use as you will, but of course you might 
first check if design suits to yours. 

Original comment by SasaVi...@gmail.com on 28 Apr 2011 at 10:51

Attachments:

GoogleCodeExporter commented 8 years ago
Yes your patch works well for me. I had similar problems with german but now it 
works. Thx for this patch

Original comment by frank.ro...@gmail.com on 25 Sep 2011 at 4:47

GoogleCodeExporter commented 8 years ago
As of version 3.0, crawler4j automatically detects the encoding.

-Yasser

Original comment by ganjisaffar@gmail.com on 2 Jan 2012 at 3:54