jaimeiniesta / metainspector

Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...
https://github.com/metainspector/metainspector
MIT License
1.03k stars 165 forks source link

ArgumentError: invalid byte sequence in UTF-8 #187

Closed tak1n closed 8 years ago

tak1n commented 8 years ago

Backtrace:

/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/metainspector-5.2.2/lib/meta_inspector/request.rb:31:in `tr'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/metainspector-5.2.2/lib/meta_inspector/request.rb:31:in `read'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/metainspector-5.2.2/lib/meta_inspector/document.rb:104:in `document'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/metainspector-5.2.2/lib/meta_inspector/document.rb:78:in `to_s'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/metainspector-5.2.2/lib/meta_inspector/parser.rb:33:in `parsed'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/metainspector-5.2.2/lib/meta_inspector/parser.rb:20:in `initialize'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/metainspector-5.2.2/lib/meta_inspector/document.rb:39:in `new'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/metainspector-5.2.2/lib/meta_inspector/document.rb:39:in `initialize'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/metainspector-5.2.2/lib/meta_inspector.rb:20:in `new'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/metainspector-5.2.2/lib/meta_inspector.rb:20:in `new'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/bundler/gems/meta-3a774f370db8/lib/onlim/meta/parser.rb:26:in `initialize'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/bundler/gems/meta-3a774f370db8/lib/onlim/meta/parser.rb:9:in `new'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/bundler/gems/meta-3a774f370db8/lib/onlim/meta/parser.rb:9:in `extract'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/bundler/gems/meta-3a774f370db8/lib/onlim/meta.rb:16:in `parse'
/home/benny/dev/onlim/app/app/services/suggestions/rss/extract.rb:13:in `call'
/home/benny/dev/onlim/app/app/models/rss_content_source.rb:17:in `block (3 levels) in lookup'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/activerecord-4.2.6/lib/active_record/connection_adapters/abstract/connection_pool.rb:292:in `with_connection'
/home/benny/dev/onlim/app/app/models/rss_content_source.rb:16:in `block (2 levels) in lookup'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/concurrent-ruby-1.0.2/lib/concurrent/executor/safe_task_executor.rb:24:in `block in execute'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/concurrent-ruby-1.0.2/lib/concurrent/synchronization/mri_lockable_object.rb:38:in `block in synchronize'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/concurrent-ruby-1.0.2/lib/concurrent/synchronization/mri_lockable_object.rb:38:in `synchronize'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/concurrent-ruby-1.0.2/lib/concurrent/synchronization/mri_lockable_object.rb:38:in `synchronize'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/concurrent-ruby-1.0.2/lib/concurrent/executor/safe_task_executor.rb:19:in `execute'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/concurrent-ruby-1.0.2/lib/concurrent/ivar.rb:170:in `safe_execute'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/concurrent-ruby-1.0.2/lib/concurrent/future.rb:52:in `block in execute'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/concurrent-ruby-1.0.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:348:in `run_task'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/concurrent-ruby-1.0.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:337:in `block (3 levels) in create_worker'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/concurrent-ruby-1.0.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:320:in `loop'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/concurrent-ruby-1.0.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:320:in `block (2 levels) in create_worker'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/concurrent-ruby-1.0.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:319:in `catch'
/home/benny/dev/onlim/app/.gem/ruby/2.3.1/gems/concurrent-ruby-1.0.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:319:in `block in create_worker'

Reproducable with:

MetaInspector.new('http://www.dnevnik.bg/sviat/2016/08/11/2810035_konstitucionnata_kriza_v_polsha_se_zadulbochava/?ref=rss')

I think the best bet would be to allow a already parsed Nokogiri document as argument for MetaInspector.new, otherwise if there is no way to inject a custom document or preprocess the html somehow MetaInspector has to deal with all those things (I think this can get really annoying).

If it is a desired option to allow an nokogiri document or html as MetaInspector.new argument I will happily try to form a pull request for that.

tak1n commented 8 years ago

Whoops totally missed it:

You can also include the html which will be used as the document to scrape:

page = MetaInspector.new("http://sitevalidator.com",
                         :document => "<html>...</html>")

Sorry for the noise