Closed onli closed 7 years ago
A consistent internal format that is capable enough to handle any schema may be nice. E.g. JSON-LD (maps to/from RDF). That gives blocks the capability of describing the data as it passes through the pipeline, in a standard way.
Selectors would then use (e.g.) JSONiq rather than CSS/XPath.
There could be a range of ingestion blocks that transform from freetext/CSV/JSON/RSS/Atom/iCal/... to JSON-LD, with all processing blocks operating on JSON-LD. And then output blocks to transform back to the desired output format.
That seems easier than writing e.g. extract
to operate on text/CSV/RSS/...
Something like goodtables may be useful for ingestion of CSV.
Hey, thanks for thinking with me through this. That's really helpful.
I do not think a new internal format would be the solution. What we have currently is RSS as an internal format (and both Atom and jsonfeed gets transformed into it). Everything but the download block produces it. The main enhancement I'd like to support is that we might want to manipulate (like extract) data from the content field of the RSS data, not just from the RSS feed as a whole.
If we had a new format that wrapped content, we'd still have the exact same problem - varying data in the content field that we can't directly address.
So, why not just operate with all blocks on the content? Because in some cases, one wants to extract data from the RSS feed itself, like the link in #7.
I see two routes currently:
I currently lean to alternative 2, maybe with some additions like flow control blocks (the duplicate and combine block) supporting also non-RSS input.
The newest version addresses this. There are some changes that together improve the situation, I think they remove the blocker I saw before. A list:
//item/content
, without any XML.item.content
on. That way, one can give a JSONPath to an extractblock, enable the checkbox, and connect an RSS feed that contains JSON data in its items to it. And get a new RSS feed but with the extracted JSON data.I will close this issue with that comment, but that is not to say this should get no further work or discussion.
Currently, all block but the download block generate and expect RSS/XML as input. It would be helpful for some use cases if blocks could also work with less structured data. For example, each POST to a webhook could be an array item that that the other blocks can work with. Downloaded text files could be treated line by line.
For this to work, all blocks would need to accept non-RSS input, and implicit RSS conversion (like with the webhook and extract block) would need to be disabled.