jaimeiniesta / metainspector

Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...
https://github.com/metainspector/metainspector
MIT License
1.03k stars 165 forks source link

Unicode Normalization not appropriate for ASCII-8BIT (Encoding::CompatibilityError) #414

Open specter78 opened 10 months ago

specter78 commented 10 months ago

https://apps.apple.com/us/app/id1501495423 redirects to https://apps.apple.com/us/app/space-marshals-3/id1501495423.

However, running metainspector on https://apps.apple.com/us/app/id1501495423 gives the above error, whereas metainspector is able to parse the response of https://apps.apple.com/us/app/space-marshals-3/id1501495423.

jaimeiniesta commented 10 months ago

Thanks for the report @specter78

I'm currently super busy but I'll be happy to review a PR.

divagueame commented 9 months ago

Hi, I was having a look at this bug and it seems to me that this issue stems from Faraday itself not from Metainspector. The same behaviour happens when running this:


        session = Faraday.new(url: 'https://apps.apple.com/us/app/id1501495423') do |faraday|
            faraday.use Faraday::FollowRedirects::Middleware
            faraday.use :cookie_jar
        end

        session.get

Same bug happens with any other url that redirects: Faraday.new(url: 'https://nytimes.com') do |faraday| ...

This same code runs on lib/metainspector/request.rb

         if @allow_redirections
            follow_redirects_options[:limit] ||= 10
            faraday.use Faraday::FollowRedirects::Middleware, **follow_redirects_options
            faraday.use :cookie_jar
          end

Ideally this should be fixed on Faraday, but on Metainspector it'd be possible to catch the error and handle it appropiately when/if it happens.

jaimeiniesta commented 9 months ago

Thanks for throwing some light on this @divagueame

I'm still super busy but if someone could take care of a PR, for Faraday or MetaInspector, I'll do my best to review it.

tisba commented 8 months ago

This issue has been fixed upstream, thanks for reporting @divagueame 🙏