lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
438 stars 36 forks source link

How to use reader to parse string directly ? #255

Closed sorasful closed 1 year ago

sorasful commented 2 years ago

Hello !

Thank for your work.

I was wondering how one could use your library to parse feeds using directly str. I mean, I can pass any str like atom, rss or json feed and it would parse like it already does. Would it be possible ? Imagine, we don't want to parse using an url, or using something else than requests.

How could we achieve this ?

Thanks.

lemon24 commented 2 years ago

Hello, thank you for opening this! :)

First, to ask some clarification questions (they overlap somewhat, but I'm trying to cover all the angles here).

  1. What problem are you trying to solve? That is, why do you need this?
  2. If I understand correctly, you already have the feed content (the actual XML / JSON), and want to parse that; is this correct?
  3. Where is the feed content coming from? A file? A Requests response?
  4. Do you need the result to be stored by Reader? That is, do you just need the resulting entries, or do you want to be able to call get_/search_entries() and filter them in various ways?

Assuming you have the content and just want the entries, the logic for that is in reader._parser.

This is not part of the public API, but it's relatively unlikely to change in the future; I've been planning to make part of it public, but I didn't have enough use cases and I didn't want to rush it (once it's public, it has to be backwards-compatible). Your use case is a new one :)

I'll explain briefly how it works (the actual logic is in the link above):

reader._parser.default_parser() returns a "meta-parser" object; you use it like this:

# meta_parser() also takes optional caching headers,
# but they're not relevant here
feed, entries, *_ = meta_parser(url)

Underneath, the meta-parser does this:

# the retriever knows how to get a file object and a MIME type from a URL;
# again, ignoring some caching-related details
retriever = meta_parser.get_retriever(url)
file, mime_type, *_ = retriever(url)
if not mime_type:
    mime_type, _ = mimetypes.guess_type(url)

# we get a parser corresponding to the MIME type,
# and ask it to parse the contents of the file object;
# it being a file object avoids keeping the entire content in memory
parser = self.get_parser_by_mime_type(mime_type)
feed, entries = parser(url, file)

Note that URL can be the path to a local file; for this, you need to call default_parser(feed_root=...) (it has the same meaning as the make_reader() argument.


Also note that you must have a valid MIME type for the content you want to parse; otherwise, it'll always assume it's Atom/RSS and pass it to feedparser (that is, JSON Feed won't work).

The meta-parser uses the MIME type in headers, or guesses it from the URL if there are no headers (see linked code for details).

If you just have the content, it is possible to guess the MIME type based on the actual content:

from io import BytesIO
from reader._parser import default_parser

# https://pypi.org/project/python-magic/
import magic

data: bytes
mime_type = magic.detect_from_content(data).mime_type
meta_parser = default_parser()
parser = meta_parser.get_parser_by_mime_type(mime_type)
# BytesIO is a (binary) file-like wrapper over data;
# note that if your data comes from any kind of stream 
# (open file, socket, Request response),
# it's more efficient to let the parser read() it directly
feed, entries = parser(url='', file=BytesIO(data))

If you want to use a Reader object, the easiest way is to use a temporary file (note it needs to have the proper extension):

data: bytes
reader = make_reader(':memory:', feed_root='')
suffix = '.xml'  # or '.json' for JSON Feed

with tempfile.NamedTemporaryFile(suffix=suffix) as file:
    file.write(data)
    file.flush()
    reader.add_feed(file.name)
    reader.update_feeds()

# use reader normally
sorasful commented 2 years ago

Wow, thank you very much for this very detailled answer ! I did not expect that much TBH :)

My current use case is this :

from reader import Parser

entries = Parser.parse('<rss ..... content as string')

So as I understood your answer: to be able to use Atom/Rss/Json feeds, I'll need to use the magic library directly on the content to determine the Mime type since I do not use the url to determine the content.

Basically, my use case is just a to use your work as a parser, and me being agnostic of what kind of feed it is. I would only have to deal with errors.

I hope I was a bit clearer.

Again, thanks for your time.

lemon24 commented 2 years ago

You're welcome!

to be able to use Atom/Rss/Json feeds, I'll need to use the magic library directly on the content to determine the Mime

Since you do have both an URL and the response headers, you don't need to guess the MIME type based on the content. Instead, you can do exactly what the reader meta-parser is doing, but with httpx instead of Requests; that is (pseudocode):

meta_parser = default_parser()

# pretending this works kinda like Requests,
# you'll have to see how httpx does it
response = await httpx.get(url)
mime_type = response.mime_type
headers = response.headers
# content has to be bytes, not str
content = response.content

# the rest of the code is similar to
# https://github.com/lemon24/reader/blob/2.1/src/reader/_parser.py#L165-L188

if not mime_type:
    mime_type, _ = mimetypes.guess_type(url)
if not mime_type:
    mime_type = 'application/octet-stream'

parser = meta_parser.get_parser_by_mime_type(mime_type)
assert parser is not None

# the URL and headers allow the parser to 
# resolve links, decode the content etc.
feed, entries = parser(url, BytesIO(content), headers)

Note that last parser(...) call may take some time, so you should likely run it ouside of the main loop (e.g. with run_in_executor()).


I guess the duplicated logic above could be extracted into a method (but I'm not sure this is the best API):

meta_parser.parse_file(url, mime_type, BytesIO(content), headers)

If I get around to adding it, I'll add a comment in this issue.

lemon24 commented 1 year ago

The parser internal API is now documented (although still unstable): https://reader.readthedocs.io/en/latest/internal.html#module-reader._parser

I added an HTTPX example at the end: https://reader.readthedocs.io/en/latest/internal.html#parsing-a-feed-retrieved-with-something-other-than-reader

Closing.