jsumners / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

Relative links are not resolved in a feed that contains no base URI information #415

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

>>> import feedparser as p 
>>> d = p.parse('http://intertwingly.net/blog/index.atom')
>>> d.feed.link
u'http://intertwingly.net/blog/'

And this is good. Now try this:

$ curl http://intertwingly.net/blog/index.atom > f.xml

And back to Python interpreter:

>>> d = p.parse('f.xml')
>>> d.feed.link
u'/blog/'

What is the expected output? What do you see instead?

u'/blog/'

Should be 

u'http://intertwingly.net/blog/'

What version of the product are you using? On what operating system?

5.1.3 

Please provide any additional information below.

I think Feedpaser should resolve the relative link regardless the fact the feed 
was loaded from file. Am I missing something? 

Thanks!

Original issue reported on code.google.com by and...@passiomatic.com on 11 Oct 2013 at 9:21

GoogleCodeExporter commented 9 years ago
No, the XML file does not contain any explicit base URI that feedparser can use 
to resolve the relative links. I'm under the impression that it's poor form to 
use relative URI's but as this is Sam Ruby's site I'm surprised by this -- it 
makes me think I'm missing something.

At any rate, the information that I have is that the XML would need to include 
an explicit base URI to guarantee that the downloaded file's relative URI's get 
resolved correctly. The file doesn't appear to have a base URI set and for that 
reason feedparser isn't resolving the relative URI's.

Original comment by kurtmckee on 10 Jul 2014 at 5:01

GoogleCodeExporter commented 9 years ago
So if I understand correctly your comment, Feedparser when fetching the Sam 
Ruby feed it fallback do this case: 

"...the URL used to retrieve the feed itself is the default base URI for all 
relative links within the feed. If the feed was retrieved via an HTTP redirect 
(any HTTP 3xx status code), then the final URL of the feed is the default base 
URI."

There are - admittedly - few blogs that insist to use relative links in posts. 
My hope was to fake the Content-Location header by passing it to the parse 
function via the response_headers (or request_headers?). 

Basically I'm in a scenario where I have a bunch of feeds already downloaded 
via the wonderful Requests package, which is far more robust when it comes to 
fetch web resources. 

I would like to know if the response_headers param is supposed to be used that 
way.

Thanks.

Original comment by and...@passiomatic.com on 10 Jul 2014 at 6:50

GoogleCodeExporter commented 9 years ago
It's supposed to be possible to use the `response_headers` parameter to pass in 
what the requests module returns. If that doesn't work as expected please open 
a ticket! =)

Original comment by kurtmckee on 10 Jul 2014 at 10:36