lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
456 stars 38 forks source link

How to add custom headers to the reader? #336

Closed Scylla2020 closed 5 months ago

Scylla2020 commented 5 months ago

How do I add headers that can be used by the underlying requests session? I currently have

from reader import make_reader

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'max-age=0',
    'priority': 'u=0, i',
    'sec-ch-ua': '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36',
}

reader = make_reader("db.sqlite")

feed_url = "https://nitter.poast.org/elonmusk/rss"

reader.update_feeds()
reader.add_feed(feed_url, exist_ok=True)
add_and_update_feed()

feed = reader.get_feed(feed_url)
print(feed)

It fails and I get a long response part of which is

Feed added successfully. Feeds updated successfully. Feed(url='https://nitter.poast.org/elonmusk/rss', updated=None, title=None, link=None, author=None, subtitle=None, version=None, user_title=None, added=datetime.datetime(2024, 6, 14, 21, 24, 3, 840331, tzinfo=datetime.timezone.utc), last_updated=None, last_exception=ExceptionInfo(type_name='reader.exceptions.ParseError', value_str="bad HTTP status code: 'https://nitter.poast.org/elonmusk/rss': requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://nitter.poast.org/elonmusk/rss",

When I simply use requests library with those headers I get the correct response so not sure how to add the headers to reader?

lemon24 commented 5 months ago

Check out SessionWrapper.request_hooks and RequestHook:

>>> from reader import make_reader
>>> reader = make_reader('')
>>> reader.add_feed('http://localhost:8080')
>>> 
>>> def hook(session, request, **kwargs):
...     request.headers.setdefault('custom', 'header')
... 
>>> reader._parser.session_factory.request_hooks.append(hook)
>>> reader.update_feed('http://localhost:8080')

Note the custom header received by nc:

 $ echo 'HTTP/1.1 304' | nc -l localhost 8080
GET / HTTP/1.1
Host: localhost:8080
User-Agent: python-reader/3.13.dev0 (+https://github.com/lemon24/reader)
Accept-Encoding: gzip, deflate
Accept: application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/feed+json,application/xml;q=0.9,application/json;q=0.9,text/xml;q=0.2,*/*;q=0.1
Connection: keep-alive
A-IM: feed
custom: header

If you want to retry a request only for specific responses, check out ResponseHook; for example, the ua_fallback plugin uses it to retry with a different user agent if it gets a 403 the first time (source).