lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
439 stars 36 forks source link

JSON Feed support #205

Closed lemon24 closed 3 years ago

lemon24 commented 3 years ago

https://en.m.wikipedia.org/wiki/JSON_Feed

https://jsonfeed.org/

Asked about in https://www.reddit.com/r/selfhosted/comments/kioq3g/comment/ggs3kuk?context=3


Question: Is this worth supporting, or a case of featuritis?

The Wikipedia page mentions NPR as a publisher that supports it, and the latest version of the spec mentions about 10 other websites.

Update: Here's some more users: https://indieweb.org/JSON_Feed

We could make it a plug-in.


Regardless of the support required, this is an interesting use case, since to implement it as a separate parser we'd need a way of delegating by extension and/or MIME type.

At the moment, we can only delegate to a parser by feed URL prefix (and making people add "json+http://..." to their feeds is not exactly user friendly).

lemon24 commented 3 years ago

OK, to implement this in a modular way, we'll split the current "subparsers" (HTTPParser/FileParser) into a Retriever and a (Sub)Parser.

The Retriever:

The (Sub)Parser:

Here's pseudo-code of how they all fit together in the (Meta)Parser (the current Parser class):


# input
url: str = ...
# currently http_etag and http_last_modified
caching_headers: dict = ...

# actually stored on a Parser instance
RETRIEVERS = [HTTPRetriever(), FileRetriever()]
PARSERS = [JSONFeedParser(), FeedparserParser()]

# actually a Parser method
retriever = get_retriever(url)

http_accept = merge_accept_headers(p.accept_headers for p in PARSERS)

file, mime_type, caching_headers, headers = retriever.get(
    url, caching_headers, http_accept
)
if not mime_type:
    mime_type = mimetype.guess_type(url)

# actually a Parser method
parser = get_parser(mime_type)

parsed_feed = parser(url, file, headers)

rv = parsed_feed, caching_headers

Here's how (sub)parser selection works:

from werkzeug.datastructures import MIMEAccept
from werkzeug.http import parse_accept_header, parse_options_header

# the accept headers come from parser.accept_header,
# except for the wildcard, which is added manually;
# in practice, feedparser and feedparser (catch-all) are the same object
PARSERS = [
    (parse_accept_header(a, MIMEAccept), parser)
    for a, parser in [
        # everything in feedparser.http.ACCEPT, except the wildcard (*/*);
        # only a few included for brevity
        ("application/atom+xml,application/xml;q=0.9", "feedparser"),
        ("application/feed+json,application/json;q=0.9", "jsonfeed"),
        # for backwards compatibility
        ("*/*;q=0.1", "feedparser (catch-all)"),
    ]
]

def get_parser(mime_type):
    for accept, parser in PARSERS:
        if accept.best_match([mime_type]):
            return parser

def merge_accept_headers():
    values = []
    for accept, _ in PARSERS:
        values.extend(accept)
    return MIMEAccept(values).to_header()

print(merge_accept_headers())

content_types = [
    "application/xml; charset=ISO-8859-1",
    "application/xml",
    "application/whatever+xml",
    "application/json",
    "unknown/type",
]

for content_type in content_types:
    mime_type, _ = parse_options_header(content_type)
    print(content_type, '->', get_parser(mime_type))

"""
application/atom+xml,application/feed+json,application/xml;q=0.9,application/json;q=0.9,*/*;q=0.1
application/xml; charset=ISO-8859-1 -> feedparser
application/xml -> feedparser
application/whatever+xml -> feedparser (catch-all)
application/json -> jsonfeed
unknown/type -> feedparser (catch-all)
"""
lemon24 commented 3 years ago

To do:

lemon24 commented 3 years ago
OK, I added / updated all the feeds below: ``` http://shapeof.com/feed.json http://flyingmeat.com/blog/feed.json http://maybepizza.com/feed.json https://daringfireball.net/feeds/json http://hypercritical.co/feeds/main.json http://inessential.com/feed.json https://manton.org/feed/json https://micro.blog/feeds/manton.json http://timetable.manton.org/feed.json http://therecord.co/feed.json http://www.allenpike.com/feed.json https://jsonfeed.org/feed.json https://adactio.com/articles/feed.json https://jonnybarnes.uk/blog/feed.json https://matthiasott.com/articles/feed.json https://ascraeus.org/jsonfeed/index.json https://feeds.npr.org/1019/feed.json https://feeds.npr.org/510317/feed.json ```

Most things look fine: authors, dates, attachments, HTML, titles.

The only issue is that feed.updated isn't set (the spec doesn't specify one); we should use the newest entry for that.

Update: This is not only specific to JSON feeds, cut #214 for it.

lemon24 commented 3 years ago

Time spent:

             hours
thing             
design         2.5
refactoring    8.0
tests          2.5
json feed      5.0
cleanup        2.0
docs           0.5

20.5