saadii / abot

Automatically exported from code.google.com/p/abot
Apache License 2.0
0 stars 0 forks source link

Create an auto encoding solution that detects the decoding of each page based on headers #112

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
-Create an auto encoding solution that detects the encoding of each page based 
on headers sent or meta encoding values. 
(http://www.w3.org/International/questions/qa-html-encoding-declarations)
-Provide a configurable default encoding other than utf-8 that will be used 
when no encoding information can be found in page html or headers.

Original issue reported on code.google.com by sjdir...@gmail.com on 8 Jul 2013 at 2:55

GoogleCodeExporter commented 9 years ago
Check the forum for the discussion over this encoding topic.

Original comment by sjdir...@gmail.com on 8 Jul 2013 at 2:56

GoogleCodeExporter commented 9 years ago
title should be "encoding of each page"

Original comment by sjdir...@gmail.com on 8 Jul 2013 at 3:12

GoogleCodeExporter commented 9 years ago
miljours@.....com

if found this solution when a have a page in charset: iso-8859-1.
//Fixed when latin charset detection into the stream.
using (StreamReader sr = new
StreamReader(response.GetResponseStream(),Encoding.GetEncoding(response.Characte
rSet)))

this is in the PageRequester.cs line 131 i don't know if is useful to share.

Original comment by sjdir...@gmail.com on 18 Jul 2013 at 4:19

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 3 Sep 2013 at 1:49

GoogleCodeExporter commented 9 years ago
Added auto encoding and CrawledPage.Content.Bytes which should allow data to be 
writtent to file stream without corruption.

Original comment by sjdir...@gmail.com on 17 Sep 2013 at 2:43

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I'm having errors with the encoding. 
How can i tell my crawler to use Encoding.UTF7 as the encoding type?

Original comment by TysH...@gmail.com on 16 Oct 2013 at 11:17

GoogleCodeExporter commented 9 years ago
This is not released yet, in the meantime you can read these two for 
workarounds...

https://groups.google.com/forum/#!topic/abot-web-crawler/lIGxg0oPmTc
https://groups.google.com/forum/#!topic/abot-web-crawler/-U9MDiSBbGM

Original comment by sjdir...@gmail.com on 17 Oct 2013 at 6:57

GoogleCodeExporter commented 9 years ago
Issue 123 has been merged into this issue.

Original comment by sjdir...@gmail.com on 3 Jan 2014 at 3:06