jaimeiniesta / metainspector

Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...
https://github.com/metainspector/metainspector
MIT License
1.03k stars 165 forks source link

Absolutify base href #240

Open jaimeiniesta opened 5 years ago

jaimeiniesta commented 5 years ago

Some pages like https://www.delta.com/us/en have a relative base href tag:

<base href="/">

This makes the scraping fail because we expect it to be an absolute URL.

To fix this, we should also absolutify this base href with the url of the scraped page. If the base href was already an absolute one, it won't get changed.

navarasu commented 3 years ago

We can consider this / like empty base href. We handle it in same way as we did it here Some thing like this

def base_url
   current_base_href =  ['/',nil,''].any?('base_href.to_s.strip) ? nil : base_href
   current_base_href || url
end

Please share your thoughts

jaimeiniesta commented 3 years ago

No, I don't think a base href of "/" should be treated as an empty one. It means different things: if empty, it need to be ignored, but if it says "/", the document author is trying to say that relative links should be built from the root directory. For example:

Let's say there's a page http://example.com/some/dir/first.html and it has a link:

<a href="second.html">Second page</a>

When there is no base href (or it is empty and we ignore it), this relative link will be absolutified as http://example.com/some/dir/second.html

Instead, if the base href is / it should be treated as if it was http://example.com/ so the absolutified link would be http://example.com/second.html

If the base href was /other then the absolutified link would be http://example.com/other/second.html

navarasu commented 3 years ago

Yeah. It makes sense. Then I think that the below changes will solve this.

def base_url
   current_base_href = base_href.to_s.strip.empty? ? nil : URL.absolutify(base_href, URL.new(url).root_url)
   current_base_href || url
end
navarasu commented 3 years ago

@jaimeiniesta Please check this PR. Time being I have overridden this method in my project to fix the failure.