Work with non-RSS data - Githubissues

onli commented 7 years ago

Currently, all block but the download block generate and expect RSS/XML as input. It would be helpful for some use cases if blocks could also work with less structured data. For example, each POST to a webhook could be an array item that that the other blocks can work with. Downloaded text files could be treated line by line.

For this to work, all blocks would need to accept non-RSS input, and implicit RSS conversion (like with the webhook and extract block) would need to be disabled.

jean commented 7 years ago

A consistent internal format that is capable enough to handle any schema may be nice. E.g. JSON-LD (maps to/from RDF). That gives blocks the capability of describing the data as it passes through the pipeline, in a standard way. Selectors would then use (e.g.) JSONiq rather than CSS/XPath. There could be a range of ingestion blocks that transform from freetext/CSV/JSON/RSS/Atom/iCal/... to JSON-LD, with all processing blocks operating on JSON-LD. And then output blocks to transform back to the desired output format. That seems easier than writing e.g. extract to operate on text/CSV/RSS/... Something like goodtables may be useful for ingestion of CSV.

onli commented 7 years ago

Hey, thanks for thinking with me through this. That's really helpful.

I do not think a new internal format would be the solution. What we have currently is RSS as an internal format (and both Atom and jsonfeed gets transformed into it). Everything but the download block produces it. The main enhancement I'd like to support is that we might want to manipulate (like extract) data from the content field of the RSS data, not just from the RSS feed as a whole.

If we had a new format that wrapped content, we'd still have the exact same problem - varying data in the content field that we can't directly address.

So, why not just operate with all blocks on the content? Because in some cases, one wants to extract data from the RSS feed itself, like the link in #7.

I see two routes currently:

Drop RSS as internal format. Have special blocks - or automatic detection - for additional supported formats, like pure JSON and CSV, and just not convert them into RSS. Blocks would need to support them everywhere as input, where applicable. All blocks that so far assume the content field as the one they get data from would need a field selection (similar how Yahoo Pipes had).
Continue with RSS after input blocks. To add support for additional formats, add those at an input block level. Like pure text getting transformed to RSS feeds line by line, JSON by field extraction (as with HTML downloads currently), csv probably a mix of those. Everything after that stays RSS. We add additional blocks to manipulate data in the RSS items content field, like regex-enabled string replacement.

I currently lean to alternative 2, maybe with some additions like flow control blocks (the duplicate and combine block) supporting also non-RSS input.

onli commented 7 years ago

The newest version addresses this. There are some changes that together improve the situation, I think they remove the blocker I saw before. A list:

You can now give downloaded pure text to the rss builder block, and he will put each line in a new feed item content. This is supposed to enable the workflow of treating everything as feed.
- For the other direction, if one appends .txt to the feed url, pipes will only output all //item/content, without any XML.
- There still was the problem of having data being trapped in RSS feeds. But I realized the only block where that is problematic so far is the extract block. The solution was to give it an option to target data from item.content on. That way, one can give a JSONPath to an extractblock, enable the checkbox, and connect an RSS feed that contains JSON data in its items to it. And get a new RSS feed but with the extracted JSON data.
- Because yes, the extract block now can get a JSONPath expression :)
- And finally, to make all this a bit easier, in the UI the download block can only be connected to the pipe output, the extract block or a feed builder block.

I will close this issue with that comment, but that is not to say this should get no further work or discussion.

pipes-digital / pipes

Work with non-RSS data #8