JSON Feed support - Githubissues

lemon24 commented 3 years ago

https://en.m.wikipedia.org/wiki/JSON_Feed

Asked about in https://www.reddit.com/r/selfhosted/comments/kioq3g/comment/ggs3kuk?context=3

Question: Is this worth supporting, or a case of featuritis?

The Wikipedia page mentions NPR as a publisher that supports it, and the latest version of the spec mentions about 10 other websites.

Update: Here's some more users: https://indieweb.org/JSON_Feed

We could make it a plug-in.

Regardless of the support required, this is an interesting use case, since to implement it as a separate parser we'd need a way of delegating by extension and/or MIME type.

At the moment, we can only delegate to a parser by feed URL prefix (and making people add "json+http://..." to their feeds is not exactly user friendly).

lemon24 commented 3 years ago

OK, to implement this in a modular way, we'll split the current "subparsers" (HTTPParser/FileParser) into a Retriever and a (Sub)Parser.

The Retriever:

Is selected by URL prefix (like subparsers are now).
Arguments:
- URL
- optional caching headers
- Accept headers from all the known parsers
Returns:
- file-like object
- optional MIME type
- optional caching headers
- optional response HTTP headers
If no MIME type is returned, it's guessed from the URL using the mimetypes stdlib module.

The (Sub)Parser:

Is selected by the MIME type returned by the parser. (We should probably have feedparser as a fallback when no MIME type can be guessed, for backwards compatibility.)
- Should there be a way to special-case an URL (prefix)? How do we support plugins like sqlite_releases?
- How? Exact match? Do we support type/* and */*? Should application/unknown+xml fall back to application/xml?
- feedparser uses the following Accept headers at the moment: application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1 (note the */* catchall).
- JSON Feed uses application/json (v1) and application/feed+json.
- Arguments:
- URL
- file object
- response HTTP headers
Returns: the parsed feed.

Here's pseudo-code of how they all fit together in the (Meta)Parser (the current Parser class):


# input
url: str = ...
# currently http_etag and http_last_modified
caching_headers: dict = ...

# actually stored on a Parser instance
RETRIEVERS = [HTTPRetriever(), FileRetriever()]
PARSERS = [JSONFeedParser(), FeedparserParser()]

# actually a Parser method
retriever = get_retriever(url)

http_accept = merge_accept_headers(p.accept_headers for p in PARSERS)

file, mime_type, caching_headers, headers = retriever.get(
    url, caching_headers, http_accept
)
if not mime_type:
    mime_type = mimetype.guess_type(url)

# actually a Parser method
parser = get_parser(mime_type)

parsed_feed = parser(url, file, headers)

rv = parsed_feed, caching_headers

Here's how (sub)parser selection works:

from werkzeug.datastructures import MIMEAccept
from werkzeug.http import parse_accept_header, parse_options_header

# the accept headers come from parser.accept_header,
# except for the wildcard, which is added manually;
# in practice, feedparser and feedparser (catch-all) are the same object
PARSERS = [
    (parse_accept_header(a, MIMEAccept), parser)
    for a, parser in [
        # everything in feedparser.http.ACCEPT, except the wildcard (*/*);
        # only a few included for brevity
        ("application/atom+xml,application/xml;q=0.9", "feedparser"),
        ("application/feed+json,application/json;q=0.9", "jsonfeed"),
        # for backwards compatibility
        ("*/*;q=0.1", "feedparser (catch-all)"),
    ]
]

def get_parser(mime_type):
    for accept, parser in PARSERS:
        if accept.best_match([mime_type]):
            return parser

def merge_accept_headers():
    values = []
    for accept, _ in PARSERS:
        values.extend(accept)
    return MIMEAccept(values).to_header()

print(merge_accept_headers())

content_types = [
    "application/xml; charset=ISO-8859-1",
    "application/xml",
    "application/whatever+xml",
    "application/json",
    "unknown/type",
]

for content_type in content_types:
    mime_type, _ = parse_options_header(content_type)
    print(content_type, '->', get_parser(mime_type))

"""
application/atom+xml,application/feed+json,application/xml;q=0.9,application/json;q=0.9,*/*;q=0.1
application/xml; charset=ISO-8859-1 -> feedparser
application/xml -> feedparser
application/whatever+xml -> feedparser (catch-all)
application/json -> jsonfeed
unknown/type -> feedparser (catch-all)
"""

lemon24 commented 3 years ago

To do:

[x] decide how parser matching works
[x] refactor current code
[x] implement JSON Feed parser
[x] documentation
- ~~[x] werkzeug dependency~~
- [x] changelog
- [x] index
- [x] docstrings (which?)
[x] fix sqlite_releases
[x] clean up _parser.py code
- [x] use type aliases
- [x] maybe move URL stuff into a module
- [x] reorder
- [ ] docstrings
- ~~[ ] maybe get rid of caching_get~~
- [ ] maybe get rid of _NotModified and use feed=None instead)
[x] manual test

lemon24 commented 3 years ago

OK, I added / updated all the feeds below:

``` http://shapeof.com/feed.json http://flyingmeat.com/blog/feed.json http://maybepizza.com/feed.json https://daringfireball.net/feeds/json http://hypercritical.co/feeds/main.json http://inessential.com/feed.json https://manton.org/feed/json https://micro.blog/feeds/manton.json http://timetable.manton.org/feed.json http://therecord.co/feed.json http://www.allenpike.com/feed.json https://jsonfeed.org/feed.json https://adactio.com/articles/feed.json https://jonnybarnes.uk/blog/feed.json https://matthiasott.com/articles/feed.json https://ascraeus.org/jsonfeed/index.json https://feeds.npr.org/1019/feed.json https://feeds.npr.org/510317/feed.json ```

Most things look fine: authors, dates, attachments, HTML, titles.

The only issue is that feed.updated isn't set (the spec doesn't specify one); we should use the newest entry for that.

Update: This is not only specific to JSON feeds, cut #214 for it.

lemon24 commented 3 years ago

Time spent:

             hours
thing             
design         2.5
refactoring    8.0
tests          2.5
json feed      5.0
cleanup        2.0
docs           0.5

20.5

lemon24 / reader

JSON Feed support #205