Closed lemon24 closed 3 years ago
OK, to implement this in a modular way, we'll split the current "subparsers" (HTTPParser/FileParser) into a Retriever and a (Sub)Parser.
The Retriever:
The (Sub)Parser:
type/*
and */*
? Should application/unknown+xml
fall back to application/xml
?application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1
(note the */*
catchall).application/json
(v1) and application/feed+json
.Here's pseudo-code of how they all fit together in the (Meta)Parser (the current Parser class):
# input
url: str = ...
# currently http_etag and http_last_modified
caching_headers: dict = ...
# actually stored on a Parser instance
RETRIEVERS = [HTTPRetriever(), FileRetriever()]
PARSERS = [JSONFeedParser(), FeedparserParser()]
# actually a Parser method
retriever = get_retriever(url)
http_accept = merge_accept_headers(p.accept_headers for p in PARSERS)
file, mime_type, caching_headers, headers = retriever.get(
url, caching_headers, http_accept
)
if not mime_type:
mime_type = mimetype.guess_type(url)
# actually a Parser method
parser = get_parser(mime_type)
parsed_feed = parser(url, file, headers)
rv = parsed_feed, caching_headers
Here's how (sub)parser selection works:
from werkzeug.datastructures import MIMEAccept
from werkzeug.http import parse_accept_header, parse_options_header
# the accept headers come from parser.accept_header,
# except for the wildcard, which is added manually;
# in practice, feedparser and feedparser (catch-all) are the same object
PARSERS = [
(parse_accept_header(a, MIMEAccept), parser)
for a, parser in [
# everything in feedparser.http.ACCEPT, except the wildcard (*/*);
# only a few included for brevity
("application/atom+xml,application/xml;q=0.9", "feedparser"),
("application/feed+json,application/json;q=0.9", "jsonfeed"),
# for backwards compatibility
("*/*;q=0.1", "feedparser (catch-all)"),
]
]
def get_parser(mime_type):
for accept, parser in PARSERS:
if accept.best_match([mime_type]):
return parser
def merge_accept_headers():
values = []
for accept, _ in PARSERS:
values.extend(accept)
return MIMEAccept(values).to_header()
print(merge_accept_headers())
content_types = [
"application/xml; charset=ISO-8859-1",
"application/xml",
"application/whatever+xml",
"application/json",
"unknown/type",
]
for content_type in content_types:
mime_type, _ = parse_options_header(content_type)
print(content_type, '->', get_parser(mime_type))
"""
application/atom+xml,application/feed+json,application/xml;q=0.9,application/json;q=0.9,*/*;q=0.1
application/xml; charset=ISO-8859-1 -> feedparser
application/xml -> feedparser
application/whatever+xml -> feedparser (catch-all)
application/json -> jsonfeed
unknown/type -> feedparser (catch-all)
"""
To do:
Most things look fine: authors, dates, attachments, HTML, titles.
The only issue is that feed.updated isn't set (the spec doesn't specify one); we should use the newest entry for that.
Update: This is not only specific to JSON feeds, cut #214 for it.
Time spent:
hours
thing
design 2.5
refactoring 8.0
tests 2.5
json feed 5.0
cleanup 2.0
docs 0.5
20.5
https://en.m.wikipedia.org/wiki/JSON_Feed
https://jsonfeed.org/
Asked about in https://www.reddit.com/r/selfhosted/comments/kioq3g/comment/ggs3kuk?context=3
Question: Is this worth supporting, or a case of featuritis?
The Wikipedia page mentions NPR as a publisher that supports it, and the latest version of the spec mentions about 10 other websites.
Update: Here's some more users: https://indieweb.org/JSON_Feed
We could make it a plug-in.
Regardless of the support required, this is an interesting use case, since to implement it as a separate parser we'd need a way of delegating by extension and/or MIME type.
At the moment, we can only delegate to a parser by feed URL prefix (and making people add "json+http://..." to their feeds is not exactly user friendly).