huginn / huginn

Create agents that monitor and act on your behalf. Your agents are standing by!
MIT License
43.12k stars 3.75k forks source link

Should Data Output Agent use CDATA for XML feeds? Or not escape at all? #456

Closed spdustin closed 8 years ago

spdustin commented 10 years ago

In the Data Output Agent, when receiving an XML feed of events, special HTML characters are escaped in the body of an XML element. Should this show the raw HTML, or perhaps wrap in a CDATA instruction?

cantino commented 10 years ago

Interesting. What would you expect?

spdustin commented 10 years ago

I guess, if I had the choice, I'd like the choice. I know, I know, too many choices are bad for systems, but there are times when I would want to use this to construct the precise XML for the feed when given a partial bit of XML output from another event. Others when I'd want it escaped.

As for which escaping method -- When I'm working with XML data, I prefer content that shouldn't be parsed for entities to be in a CDATA block. My assertion is that, when not including the schema or DTD of the resultant XML, any entities (even the normal sane ones: <, >, &, ' and ") should be declared. The spec suggests this for those predefined entitles:

<!ENTITY lt     "&#38;#60;">
<!ENTITY gt     "&#62;">
<!ENTITY amp    "&#38;#38;">
<!ENTITY apos   "&#39;">
<!ENTITY quot   "&#34;">

It's pedantic, sure, but some processors are pedantic (or poorly written) enough to complain. Maybe that's just because I had to work with such a poorly written quirky processor in a past project :grimacing:

The simplest way to avoid running afoul of such things is just to stick the whole content into a <![CDATA[ ]]> section and be done with it. Whether there are special characters in the payload that would otherwise need escaping, or not. Let the processor handle the content (with whatever logic the user has set up in their code or XSL stylesheets).

I don't know of the correct way to handle this in Ruby, though. My instinct is just to use Nokogiri's builder.cdata method to wrap the whole payload up after liquid interpolation of each element, but I don't know if there's a helper for Rails to just do that for you - by specifying a formatter maybe?

cantino commented 10 years ago

Assuming RSS/ATOM work well with CDATA, if you want to try and add the CDATA blocks, that would be fine with me. You could also add an option to the DataOutputAgent to skip escaping, if you'd like.

lienas commented 9 years ago

Hi, I am currently testing Huginn as replacement for Yahoo Pipes. I want to fetch different RSS-feeds, filter them for relevant keywords and create a merged RSS-feed to publish on our website. So far -so good. My Problem is the following. The description of the fetched feeds is insite an CDATA Block. Whe using the Data Out Put Agent the CDATA-Tag gets escaped. Instead of <![CDATA[ I get &lt;![CDATA[. How do I get that fixed ? THX in advance! Thomas

cantino commented 9 years ago

You could either strip them out in an EventFormattingAgent, or we could try to remove them automatically in the RSSAgent. Does #957 work better for you by any chance?

lienas commented 9 years ago

Hi Andrew,

RSS Agent works perfekt for me. Thanks.

Another question: How can I filter more than 1 Field (currently my Filter works fort he title-I also would like to filter in description and categories. The filter should work with a logical or.

Thomas

Von: Andrew Cantino [mailto:notifications@github.com] Gesendet: Montag, 21. September 2015 22:36 An: cantino/huginn huginn@noreply.github.com Cc: Thomas Lucas t.lucas@lienas.de Betreff: Re: [huginn] Should Data Output Agent use CDATA for XML feeds? Or not escape at all? (#456)

You could either strip them out in an EventFormattingAgent, or we could try to remove them automatically in the RSSAgent. Does #957https://github.com/cantino/huginn/pull/957 work better for you by any chance?

— Reply to this email directly or view it on GitHubhttps://github.com/cantino/huginn/issues/456#issuecomment-142101173.

cantino commented 9 years ago

Did you use that pull request, or just the normal RSSAgent directly?

What does your filter look like now?

lienas commented 9 years ago

I used the RSS Agent. Currently the filter (TriggerAgent) looks the following: { "expected_receive_period_in_days": "1", "keep_event": "true", "rules": [ { "type": "regex", "value": "cloud|saas|iaas|sourcing|outsourcing|provider|offshore|nearsshore|apple|IT-Services", "path": "title" } ], "message": "{{description}}" } I would like to filter also inside description and category. So that the filter works this way: Find feeds where title or description or category contains cloud or saas or ..

lienas commented 9 years ago

Hello, I found a solution, but I am not sure if that is Best Practice. I user multiple TriggerAgents and forward the all event filtered by these enents tomthe "De Duplication Agent"

Is there a better way ?

cantino commented 9 years ago

Sounds reasonable. What does the Diagram look like?

lienas commented 9 years ago

Here we are: screenshot-agent outsourcing de 2015-10-01 09-50-27 This approach works fine, but I think the ability to filter with the same set of keywords inside the TriggerAgent would be the better approach. This is because keywords fpr "Titel" ans "Beschreibung" are the same here. When I decide to modify the set of keywords, I must change them on multiple places.

I am missing the links in this diagram. It is an production envirement,. hosted on our own server.

cantino commented 9 years ago

Thanks for sharing, that makes sense. I've added support for partial matches by setting must_match to something less than the total number of rules (like 1, for OR) in #1056.

cantino commented 8 years ago

I'm going to close this issue. Please reopen if you want to discuss further or try building this!