Open konklone opened 10 years ago
Can/should we have a folder in this repo called examples
for language-specific examples of the spec? Seeing and working with live code might be a solid way to move this forward.
Personally, I think this is great. I'll spend time thinking about government entities might use this. One thing I want to give a :thumbsup: to is the differentiation between a URL and an id. Within my main line of sight--the Code--I can see how the URL may be the same for two distinctly different data records.
So a full stack implementation of this would look like what exactly?
My default position is to not write unnecessary code. So, let's evaluate whether this project is necessary.
First, I think we should be talking about Atom instead of RSS, because RSS is deficient and not the XML format of choice for representing this data.
The URL discussion is a distraction. Atom has an id
field, which must be a IRI, which is more general than a URL and can look as simple as:
scheme:rest
So, you can stick to Atom without having URLs.
without losing data fidelity (which is what RSS as written would do)
Atom is easily and frequently extended with namespaces. Of the four other motivations:
The first three can be done with Atom:
So, the only unwritten motivation that might stand up is "we just want JSON."
While there are things to gain from a JSON format (smaller payloads on mobile devices, etc.), there's a large ecosystem that exists specifically around Atom: tools and services for pushing feeds, pulling feeds, caching items, federating, identifying duplicates, etc. i.e. the full stack already exists - and more.
It would help to be clear about what's so awful about simply using Atom.
A code sketch:
scraper.py
:
def subscribers():
return ["http://kon.clone/jss-subs"]
def to_jss(item):
return to_json({
"title": "An Insightful Blog Post",
"description": "The beginning of my post is as follows...",
"url": "https://someonegreat.com/insightful-blog-post",
"published_at": "2014-04-10T19:00:00Z",
})
def notify_subscribers(subscribers, data):
for sub in subscribers:
requests.post(sub, data)
def scrape():
response = requests.get("http://myurl.com")
soup = response.text
items = soup.findAll('a', {"class":"item"})
jss = [to_jss(i) for i in items]
return jss
def scrape_and_notify():
jss = scrape()
notify_subscribers(subscribers(), jss)
cron_run("once per day", scrape_and_notify")
ingestor.py
@post("/jss-subs")
def handle_notification(params)
db.save(params)
Anyway, if it does go forward, here is some prior art:
Of course, there are other feed specs in JSON that go outside the RSS/Atom box, like activitystrea.ms (spec). And there are hypermedia projects like Collection+JSON which is based on Atom, though it's more generic (examples, tutorial, an active mailing list (I see Yehuda Katz in there, who is considering it for ember-data), they even have an IANA-registered media type).
If this project is to in fact come up with a JSON representation of Atom/RSS, then it's necessary to study prior initiatives (or join the active ones). If the goal is to just come up with something that people in this thread and a few others can work with, we should change the name and make it clear that the project is not trying to respond to the use cases and requirements of all the people who would care about a JSON mapping of Atom/RSS (which is what the current name commits this project to).
@jpmckinney those are some very good points, and the prior art looks tremendously useful. My one question is, if Atom can accomplish all/most of what @konklone laid out, then why aren't members of the scraper community (not a real community) like Courtlistener using Atom? I know you obviously can't speak for the Courtlistener folks, but I'm trying to understand if there's some larger cultural thing preventing adoption of Atom in some places, or if there is a legit technical issue that could be addressed with this project.
My understanding is that very few people who are scraping data are using any sort of standard data specification. People are using their own SQL table definitions, their own MongoDB schema, their own specifications of JSON files, etc. (JSON (on its own) is simply a file format. It doesn't tell you to use "name" for a person's name instead of "full_name", "fullName", "fn", etc.)
So, the broader question is: why aren't people using any common data spec? The answer to that is simply that:
For example, if I want to scrape data and have it appear in my app, there's no benefit to putting that in Atom, or Schema.org RDF, or whatever else. Dumping data to a database and loading it into an app doesn't benefit from a standard data specification like what we're talking about here.
The interest expressed here is to simplify tools like Scout, which instead of being monolithic projects can be broken into smaller, more reusable, more maintainable parts. Scout pulls feeds, identifies duplicates, syndicates content, etc. It has a lot in common with RSS systems. RSS systems have clearly benefited from having a common interchange format. The idea here is that a common format would help Scout's use cases. I'm saying that format can just as well be Atom, from what I understand of the problem definition.
@jpmckinney pretty well hit the nail on the head with regards to courtlistener's data format, but leaves out one factor and underplays another. The left out factor is that when we went to build an API, the biggest determinant of our format was the API library we used, tastypy. That has a descent format and we did little to change it.
The underplayed issue in @jpmckinney's comment is that I (and I suspect many others) am very bearish on standards like this one... I've just seen a lot of talk about making them, but it's much easier to talk than build. If there was a single great standard, we'd use it, but lacking one, I fear our consumers will need to make adaptors. I'm reminded of xkcd: http://xkcd.com/927/
The other thought I have is that even with a standard like this, you're often just pushing the mess around. For example, say we all use this or we all use Atom. That'll just mean more complication from our API where the standard allows customization (e.g. in your data node), and I fear that's not much of an improvement over just rolling our own data format.
I'm happy to be wrong about all this. These are just my observations, and I generally eschew discussions of data formats when I can.
I'm also bearish on standards, but maybe for a different reason: standards are simply incredibly long and hard to develop; therefore, you need to have really good reasons before you start down that road. Also, There are no shortcuts, even if you avoid standards bodies. Many people don't believe this last part. In most cases, it's better to not have a standard, and to just have some format that you and a few people you care about agree on. Maybe it will evolve into a standard, but don't aim for standardization early unless you must in order to accomplish your goal.
Tastypie offers some formatting (e.g. with respect to pagination and a few data discovery endpoints), but the values inside objects
are entirely up to each project and are therefore not standard. This project would be attempting to at least offer a few standard fields for each of those objects (e.g. id
, title
, description
, etc.).
In most cases, it's better to not have a standard, and to just have some format that you and a few people you care about agree on. Maybe it will evolve into a standard, but don't aim for standardization early unless you must in order to accomplish your goal.
That's exactly what I plan to do -- this thread is me CCing some people I care about, to see if they're interested in joining me. @adelevie and I have put down a few scrapers we could use as test beds in #2.
I drafted this after talking with @adelevie about a way for merging the work of scrapers and getting that work downstream in a sane way. Maybe the primary benefit is just scraper output. But consider the benefits of loading, say, IG reports and court opinions into an intermediary API whose sole purpose is to let multiple downstream APIs sync with them and serve them in a product-oriented way.
So it may be that a spec like this is more useful for those sort of workhorse APIs than it is for highly tailored, branded, flagship APIs like the Sunlight Congress API or CourtListener API.
It might be more obvious what I'm going for here, and why this differs from the prior art linked above, if I invert my proposed item structure.
{
"home": "https://someonegreat.com",
"author": {
"email": "someone@someonegreat.com",
"twitter": "https://twitter.com/someonegreat"
},
"tags": ["insight", "please post me to hacker news"],
"rss": {
"title": "An Insightful Blog Post",
"description": "The beginning of my post is as follows...",
"url": "https://someonegreat.com/insightful-blog-post",
"published_at": "2014-04-10T19:00:00Z",
"source": "someonegreat.com",
"type": "article",
"id": "insightful-blog-post"
}
}
I could add an @
sign in front of rss
and it'd feel a bit like JSON-LD. The above might not be useful as written though, as it's only item-level, and doesn't specify a way of getting at whatever the array of results is a given response. But that could be worked out separately.
Anyway, I'm all for Just Shipping and seeing how people react. But I don't believe XML based formats, even ones with a strong support ecology like Atom, can avoid complicating the process of data syndication compared to JSON. I don't say this because JSON is the New Thing, but because every popular high level language has a JSONic primitive that is often how the data is stored and shuttled around internally. "We just want JSON" is more than an aesthetic preference for me; it's a fundamental design choice.
I don't know why you'd add an @
in front of rss
- that's not how JSON-LD works. Not that we were discussing JSON-LD anyway.
Writing an example doesn't spell out the difference with prior art. Here's what I gather:
type
and value
properties, presumably to preserve markup and specificity from XML.value
and $t
properties (probably in order to convert any XML to JSON). Otherwise it converts the XML to objects and arrays and keeps the same XML tags as JSON keys.I'm not at all clear on why JSON-C is different from your example (example). You use published_at
instead of Atom's term published
and url
instead of link
. I wouldn't say those changes are improving your spec substantially - better to reuse the existing terms unless there is good reason not to. You add author.twitter
, but I'm not sure who actually wants to use that. author.uri
seems to cover that use case, unless the Twitter use case is very popular. I'm not sure what the semantics of home
are. You add type
, which I'm surprised isn't in Atom - maybe I missed it. All the others are in Atom.
Building on something that already exists means that you can avoid making mistakes or omitting things that you didn't realize were important earlier, but that someone else has thankfully figured out. So, at least use the same terms as Atom (why not?), and use the prior art for inspiration and guidance.
I still think changing the project name to something that better represents the scope of the project will avoid unnecessary acrimony, as the current name makes the project purport to be a true translation of RSS to JSON, which is not the goal. Maybe "Scraper Syndication Specification" or "SSS".
I don't have particularly strong feelings about any of this, but it's worth noting that Dave Winer seems to have started a conversation around it two years ago:
http://scripting.com/stories/2012/09/10/rssInJsonForReal.html http://rssjs.org/
The comments there show that folks have been talking about (and in some cases, building) this for nearly a decade. @jpmckinney has unearthed some useful parts of it in this thread already.
I'm not at all clear on why JSON-C is different from your example (example).
I don't see the relationship at all, actually - did you mean Google's feed JSON? That's a lot closer.
You use
published_at
instead of Atom's termpublished
andurl
instead oflink
. I wouldn't say those changes are improving your spec substantially - better to reuse the existing terms unless there is good reason not to.
Sure, agreed. I'm not trying to finalize specific field names here.
You add
author.twitter
, but I'm not sure who actually wants to use that.author.uri
seems to cover that use case, unless the Twitter use case is very popular. I'm not sure what the semantics ofhome
are.
Actually, those are not what I'm proposing -- those are examples of fields outside the spec, that would be specific to whatever you're syndicating. I'm suggesting that the spec make obvious space for completely unspecified fields. The never-shipped Twitter Annotations is a conceptual model here.
I don't have particularly strong feelings about any of this, but it's worth noting that Dave Winer seems to have started a conversation around it two years ago:
http://scripting.com/stories/2012/09/10/rssInJsonForReal.html http://rssjs.org/
Yeah, I think he's on to something, and it got some attention. But I don't think it's gone anywhere since (last update to rssjs.org is still Sep 2012). Google's result JSON spec is the closest mirror to RSS/Atom that I've seen shipped anywhere, but even then, some quick Github Search-ing didn't turn up much of an ecology dedicated to that spec as a general purpose format.
I feel confident that I'm suggesting something which is not covered by any successful prior art. I don't think I'm throwing away any existing ecology of support libraries that meets my needs.
How this "something" is actually shaped could be very different from what I've laid out in examples, and I'm interested in finding the most attractive and least disruptive shape for it. Ideally, this is something like "the JSON format I already have, plus the fields that RSS identified as helpful for generic syndication", which are basically title
/description
/link
/date
.
If it helps to coalesce the discussion, my most immediate question is: how do I alter the following JSON output that my IG report scrapers create, to make it easier for others to ingest:
{
"agency": "dhs",
"agency_name": "Department of Homeland Security",
"file_type": "pdf",
"inspector": "dhs",
"inspector_url": "http://www.oig.dhs.gov",
"published_on": "2014-04-01",
"report_id": "OIG-14-60",
"title": "Management Letter for the FY 2013 DHS Financial Statements and Internal Control over Financial Reporting Audit",
"type": "report",
"url": "http://www.oig.dhs.gov/assets/Mgmt/2014/OIG_14-60_Mar14.pdf",
"year": 2014
}
Any proposal is welcome!
The 19 page slideshow for Twitter Annotations didn't really give me a sense of where they were going, but I guess the idea is: publish the same JSON object as you had before, but add a key (rss
in your example) whose value is an object that has standard fields (like title
, etc.)
Re: JSON-C: look inside the items
array. It is a JSON version of an Atom document with title
, description
(and other fields that are specific to Google's use case which you can safely ignore). Google's Feed API is not as close a mapping as JSON-C, whose goal is to be a mapping.
Anyway, moving on! If the Twitter Annotations method is as I described, then we just need to agree on what term to use for that key, and agree on some terms within the object. There's plenty of generic metadata terms in Atom and of course Dublin Core which we can choose from for the object's keys. As for the key itself, HAL adds underscore keys like _links
and _embedded
. We can maybe use _item
as a nod to RSS/Atom.
So:
{
"agency": "dhs",
"agency_name": "Department of Homeland Security",
"file_type": "pdf",
"inspector": "dhs",
"inspector_url": "http://www.oig.dhs.gov",
"published_on": "2014-04-01",
"report_id": "OIG-14-60",
"title": "Management Letter for the FY 2013 DHS Financial Statements and Internal Control over Financial Reporting Audit",
"type": "report",
"url": "http://www.oig.dhs.gov/assets/Mgmt/2014/OIG_14-60_Mar14.pdf",
"year": 2014,
"_item": {
"id": "something-ending-in-OIG-14-60",
"type": "report",
"title": "Management Letter for the FY 2013 DHS Financial Statements and Internal Control over Financial Reporting Audit",
"link": "http://www.oig.dhs.gov/assets/Mgmt/2014/OIG_14-60_Mar14.pdf",
"format": "application/pdf",
"published": "2014-04-01",
"published$year": 2014
}
}
I don't know what to do with agency
, agency_name
, inspector
, inspector_url
, or if it's even necessary to have those in the _item
. I used published$year
because the year is really a property of published
and can be derived from it, and $
allows you to use JavaScript dot-notation for accessing the property. All terms are from Atom and Dublin Core. Using MIME types is a good idea where possible.
This is actually the same effort required to add a JSON-LD @context
block to a JSON document. Seems odd to not follow that approach.
The 19 page slideshow for Twitter Annotations didn't really give me a sense of where they were going, but I guess the idea is: publish the same JSON object as you had before, but add a key (rss in your example) whose value is an object that has standard fields (like title, etc.)
Sort of, but their annotations would have been totally free-form (like the data
object in my original example), and the Twitter standard fields would be the standard. Either way, the goal is to combine standard with free-form.
and $ allows you to use JavaScript dot-notation for accessing the property
Really? :o Is that a JavaScript/JSON thing I'm not familiar with?
Maybe JSON-LD has something here. Would this approach work:
{
"agency": "dhs",
"agency_name": "Department of Homeland Security",
"file_type": "pdf",
"inspector": "dhs",
"inspector_url": "http://www.oig.dhs.gov",
"published_on": "2014-04-01",
"report_id": "OIG-14-60",
"title": "Management Letter for the FY 2013 DHS Financial Statements and Internal Control over Financial Reporting Audit",
"type": "report",
"url": "http://www.oig.dhs.gov/assets/Mgmt/2014/OIG_14-60_Mar14.pdf",
"year": 2014,
"@id": "something-ending-in-OIG-14-60",
"@type": "report",
"@title": "Management Letter for the FY 2013 DHS Financial Statements and Internal Control over Financial Reporting Audit",
"@link": "http://www.oig.dhs.gov/assets/Mgmt/2014/OIG_14-60_Mar14.pdf",
"@published": "2014-04-01"
}
or, here's an alternate idea, and putting this in a full "channel"-like context, where you provide a small top-level channel object which points you to the array where @item
s are.
{
"results": [
{
"agency": "dhs",
"agency_name": "Department of Homeland Security",
"file_type": "pdf",
"inspector": "dhs",
"inspector_url": "http://www.oig.dhs.gov",
"published_on": "2014-04-01",
"report_id": "OIG-14-60",
"title": "Management Letter for the FY 2013 DHS Financial Statements and Internal Control over Financial Reporting Audit",
"type": "report",
"url": "http://www.oig.dhs.gov/assets/Mgmt/2014/OIG_14-60_Mar14.pdf",
"year": 2014,
"@item": {
"id": "something-ending-in-OIG-14-60",
"type": "ig_report",
"title": "Management Letter for the FY 2013 DHS Financial Statements and Internal Control over Financial Reporting Audit",
"link": "http://www.oig.dhs.gov/assets/Mgmt/2014/OIG_14-60_Mar14.pdf",
"published": "2014-04-01"
}
}
],
"@rss": {
"items": "results",
"channel": "name of my API"
}
}
Either way, the goal is to combine standard with free-form.
This is precisely what JSON-LD solves (in its particular linked data way, which may or may not be a deal breaker).
Really? :o Is that a JavaScript/JSON thing I'm not familiar with?
Try it in the browser:
o = JSON.parse('{"foo$bar":1}')
o.foo$bar // returns 1
For "full channel" I think simply adopting an existing hypermedia spec would be best, whether HAL, Collection+JSON or Siren. At least one of them looks a lot like your example, so you may as well have people adopt something that exists than adopt something new. Adopting one of these doesn't mean using 100% of its features - you can just adopt its way of expressing collection/item relationships if that's all you need, and drop the optional features. Anyway, to start, I think we can just focus on the item and leave channels out for now.
For the item, JSON-LD isn't just adding @
in front of words :) Here is the JSON-LD version of my _item
example:
{
"agency": "dhs",
"agency_name": "Department of Homeland Security",
"file_type": "pdf",
"inspector": "dhs",
"inspector_url": "http://www.oig.dhs.gov",
"published_on": "2014-04-01",
"report_id": "OIG-14-60",
"title": "Management Letter for the FY 2013 DHS Financial Statements and Internal Control over Financial Reporting Audit",
"type": "report",
"url": "http://www.oig.dhs.gov/assets/Mgmt/2014/OIG_14-60_Mar14.pdf",
"year": 2014,
"@context": {
"id": "http://purl.org/dc/elements/1.1/identifier",
"type": "http://purl.org/dc/terms/type",
"title": "http://purl.org/dc/terms/title",
"link": "http://www.w3.org/ns/dcat#downloadURL",
"format": "http://purl.org/dc/terms/format",
"published": "http://purl.org/dc/terms/issued"
}
}
Instead of basically creating a new JSON document with new, standard keys (which is what has been proposed so far), JSON-LD maps whatever keys you are currently using to canonical URLs. So, if one document maps full_name
to http://xmlns.com/foaf/0.1/name
and another names fn
to http://xmlns.com/foaf/0.1/name
, then I know that both terms mean the same thing, and that the term is defined at http://xmlns.com/foaf/0.1/name
.
Writing this @context
JSON-LD version was as much work as writing my previous _item
version.
With JSON-LD, you don't need to look up, "What properties does Eric's spec support?" and if the property you want to use isn't supported, you don't need to convince the maintainer to add it. You just use any of the available terms on the internet (defined by RDF).
The disadvantage is that, instead of simply reading _item.title
, you need to lookup which property is mapped to http://purl.org/dc/terms/title
in the @context
and then access that property. This is very little work though, and there are JSON-LD libraries that will do it for you.
The JSON-LD system allows you to define terms for the other properties like agency
and set up a permanent URL for that property, and as such mark up every property in your JSON document. The alternative of explicitly defining everyone's needed properties in one spec is unfeasible.
After reading some of the comments on http://rssjs.org/, it actually does seem that something like Atom might actually fit the bill here. Someone on that site said "JSON is not just XML with curly braces". Specifically, supporting namespaces in XML allows you to merge two documents trivially. And Atom seems to be built with extensibility in mind.
Of course, with my semantic web background I'm seriously considering telling you to use RDF. There's even RDF JSON!
Okay so I took a shot at an RSS to RSS-JSON converter. It feels a bit clunky and un-natural. I have it hosted on Heroku, you can try it here: http://hiss.0-z-0.com/
And here's the github of the code: https://github.com/audiodude/hiss/
Writing this @context JSON-LD version was as much work as writing my previous _item version.
With JSON-LD, you don't need to look up, "What properties does Eric's spec support?" and if the property you want to use isn't supported, you don't need to convince the maintainer to add it. You just use any of the available terms on the internet (defined by RDF).
The disadvantage is that, instead of simply reading _item.title, you need to lookup which property is mapped to http://purl.org/dc/terms/title in the @context and then access that property. This is very little work though, and there are JSON-LD libraries that will do it for you.
While writing this may have been easy enough, parsing is more of a challenge. Having to depend on a JSON-LD library to resolve external URIs gives away a lot of the benefit I described above of using straight JSON - a one-to-one mapping between high-level language primitives and data format. I'm not saying that all great specs have this property, that's not possible. But for something that aims for RSS-like simplicity for a JSON audience, I do believe ease of use without any support libraries is possible and desirable.
To explain my perspective a bit, I wrote a couple of libraries for publishing WebFinger endpoints: sinatra-webfinger and jekyll-webfinger. For both of them, I made a small YAML file that mapped bare string keys to the fully qualified URNs the spec demands. That way, anyone using jekyll-webfinger or sinatra-webfinger can specify the fields they want to publish without having to research or remember any formal URNs, because that's tough for people and would discourage them from setting up an endpoint.
I want anything that comes out of this to be extremely easy to drop in place, and extremely easy to parse. I actually really like the last solution I proposed that came out of your pushing, @jpmckinney - I'm going to start a new thread, with just the people who've participated so far, and see where it goes.
If using JSON-LD, you don't need to resolve URIs, and you don't need a JSON-LD library. All you need is to be able to compare URIs in the data to the known URIs that you care about. If I care about titles, then I will look for the URI that equals http://purl.org/dc/terms/title
to find the property that contains the title.
WebFinger is meant for much greater adoption that what we're talking about here. While the goal for WebFinger may be for anyone to be able to set up an endpoint, I think it's unrealistic to expect everyone to be running scrapers and APIs. So, let's not hold this to your WebFinger test. Programmers are used to looking up API URLs. Looking up term URLs is not a special challenge. That's actually what WebFinger makes you do if you aren't using your Ruby Jekyll or Sinatra apps, and I assume most WebFinger adopters aren't using those - and yet they succeed in adopting it.
Anyway, I'm not entirely convinced that adding @
in front of standard terms is better that putting all the standard terms inside a new object like _item
in my previous, non-JSON-LD example: https://github.com/konklone/rss-json/issues/1#issuecomment-40386555
I've sketched out my proposal for a JSON version of RSS, extended to support additional arbitrary metadata. So far it's just called RSS-JSON.
I am envisioning this as:
If you remember Twitter's beautiful (and now dead) proposal for Annotations, this is kind of what I'm going for here.
This proposal is partly inspired by writing (and seeing others write) many different scrapers for documents and associated metadata, and no easy way to get them all syndicated together without losing data fidelity (which is what RSS as written would do).
It's also partly inspired by my experience building Scout, a search engine over federated APIs, which would benefit from this syndication. When Scout searches across everything, it converts various external APIs into an RSS-JSON-like mini-format, then renders search results from that. This is highly flexible, though having to manage all those adapters is a pain.
I'm copying from the README some below. I want to emphasize that I am winging this, and don't claim to be an expert on prior art in JSON syndication. I am looking to build rough consensus, not declaring something I know to be the best.
The
source
/type
/id
fields may be weighty, but here's my first pass at what a proposed item in RSS-JSON might look like.Another one might look like:
In the above examples, there are enough top-level fields to support typical RSS-like uses, but anyone interested in processing specific kinds of documents can dig into the
data
object and go for it.data
is fair game for anyone. Create sub-specs for your use cases, start shipping them places. Your community wants to put JSON-LD in there for their work? Sounds great.This proposal doesn't address the idea of a "channel", and my thoughts on IDs vs URLs may be a distraction. I do think it'd help to take a markedly more lightweight approach to namespacing than XML does.
What do people think of the core idea? Is this something you could see using? Worth taking it further?
/cc @adelevie @jpmckinney @vzvenyach @benbalter @seanherron @waldoj @maxogden @joshdata