stewartmckee / cobweb

Web crawler with very flexible crawling options. Can either use standalone or can be used with resque to perform clustered crawls.
MIT License
227 stars 45 forks source link

Encoding problems #21

Closed wuiscmc closed 10 years ago

wuiscmc commented 10 years ago

Regardless from sidekiq or resque, I always get this error:

crawl_id: fdc9cd1655a54b3d303e2f38a916cc114c9be2c7
url: https://github.com/stewartmckee/cobweb/blob/master/.ruby-version
processing_queue: CrawlerResqueJob
crawl_finished_queue: CrawlerFinishedJob
internal_urls:
- https://github.com/stewartmckee/cobweb/blob/master/*
debug: true
raise_exceptions: true
redis_options:
  host: localhost
  port: '6379'
use_encoding_safe_process_job: false
follow_redirects: true
redirect_limit: 10
queue_system: resque
quiet: true
cache: 300
cache_type: crawl_based
timeout: 10
external_urls: []
seed_urls: []
first_page_redirect_internal: true
text_mime_types:
- text/*
- application/xhtml+xml
obey_robots: false
user_agent: cobweb/1.0.18 (ruby/1.9.3 nokogiri/1.6.0)
valid_mime_types:
- ! '*/*'
store_inbound_links: false
crawl_limit_by_page: false
parent: https://github.com/stewartmckee/cobweb/blob/master/
Exception
Encoding::UndefinedConversionError
Error
"\xC2" from ASCII-8BIT to UTF-8

The only workaround possible is to make this crawler work is to do it from inside Rails... which is a pity since I planned to build a service - without rails - integrating this crawler in my project.

Sidekiq doesnt work from inside Rails neither...

On the other hand, this error does not occur (Resque) when the encoding_flash is setup but then the process job is not being executed.

wuiscmc commented 10 years ago

Seems to be working fine in https://github.com/stewartmckee/cobweb/pulls

hallmatt commented 9 years ago

I'm having the same issue, but within Rails. The error reads:

"\xEF" from ASCII-8BIT to UTF-8

The exception is:

Encoding::UndefinedConversionError

Any thoughts?

stewartmckee commented 9 years ago

This is usually because the charater encoding specified by the server (either in the headers or content itself) does not match the content that is actually on the page. We should probably add in a bit more logic around how this state is handled. Is the url you are requesting public, could you post it for me to have a look at?

hallmatt commented 9 years ago

Sounds great. I'll send you a link via email. It is a public site.

SimonBirrell commented 8 years ago

I'm getting this too:

"\xC3" from ASCII-8BIT to UTF-8

for the URL

http://www.segurocontraroubo.com.br/wp-content/themes/segurocontraroubo/javascripts/add.js