Marking filters as escape-safe

fabpot commented 14 years ago

Hello,

I wrote a custom filter that, in addition to doing the basic htmlspecialchars() type conversion, converts any \n to
. I named the filter "custom_br".

I'm using the autoescape functionality, and I'm pondering whether it was a good idea that certain filters could be marked as "escape-safe".

That is: currently I have to write in the template:

I feel that the "|safe" part is kind of unneeded, or somehow "wrong".

Comments?

Implementation:

Currently the getFilters() hook returns an array of filters like this:

return array( 'custom_br' => array('my_function_name', false) );

I suggest an alternative syntax:

return array( 'custom_br' => array( 'function' => 'my_function_name' # required 'safe' => true, # optional 'passEnv' => false, # optional ) ) );

This would allow adding further features, such as a 'deterministic' => true for adding performance.

AFAIK:

this change could be made backward-compatible (the old syntax would be deprecated).
this change would not affect runtime performance

fabpot commented 14 years ago

This has been fixed now. This was a bug. The filter returned value should never be escaped. Now, Twig inserts the escape filter as the first one in the filter chain. Also, all arguments of the filter are escaped by default.

Read the changeset (r182) documentation for more information, but the new behavior is safer and more sensible.

fabpot commented 14 years ago

I'm not sure whether it's a good solution to pre-escape data before sending it to custom filters.

It seems to me that r182 broke things. (Thus, marking as a major defect now.) For example:

... converts 1<2 to 1%26lt%3B2, which seems horribly incorrect.

Second, pre-escaping makes it difficult to add more similar filters, eg. 'jsencode' that would prepare javscript-safe strings:

alert('{{ msg|jsencode }}');

I'm still proposing my earlier solution. A similar solution has worked for me for ages. :) And it's backwards compatible.

fabpot commented 14 years ago

I have just committed a fix for that. In your examples, urlencode and jsencode act as escapers. In such cases, you can now set the is_escaper option to true to mark them as escapers and Twig won't apply the automatic escaping.

But in your first example, custom_br is not safe as it does not enforce escaping. The new behavior seems more correct to me, we first apply escaping and then call the filter.

I think that we now have a solid and configurable mechanism.

More information in the commit documentation (r186).

Tell me if it solves your problem.

fabpot commented 14 years ago

Thanks for the quick reply. Looks better now.

However, I still find it a bit counter-intuitive and not quite handy that we first escape and then apply some other filters. It's easy to think of other non-escaping filters that would be easier to write if they could work with the original string and not an escaped one.

For example, a 'trunc' filter would return only n first characters of the string.

With foo='1<2' and auto-escape mode, {{ foo|trunc(2) }} should return '1<', not '1&'.

The 'upper' filter should not produce '1<2'. (Thus, still marking this ticket a defect.)

A 'backwards' filter should give '2<1', not '2;lt&1'. And so on.

This all is, of course, in html mode. I think that (non-escaping) filters shouldn't have to be aware what the ultimate escaping scheme is - so that the 'trunc' filter would work with other modes as well.

Are there some very good reasons for pre-escaping versus post-escaping?

fabpot commented 14 years ago

I understand your concerns. The reasoning for the current implementation is that what needs to be escaped is the data coming from an untrusted source (mainly the variables from the context), but not the filter themselves.

Let's take a simple example:

{% set foo as "hello\n" %}

Here, the nl2br filter converts \n to
.

With pre-escaping (current), the output is <alert>hello</alert>
, which is what is expected (the foo variable is escaped, but the application of the nl2br filter is safe).

With post-escaping (previous), the output would have been hello>/alert>
or <alert>hello</alert> , depending on whether the nl2br has is_escaper set to true or false. In both cases, the result is wrong.

Another possibility is to pass the auto-escaping setting to the nl2br filter, so that it applies the automatic output escaping itself. But this puts a lot of additional work on the filter developer.

fabpot commented 14 years ago

Properly handling filters that output html is a bit complicated. Some characteristics of such filters:

* They must be aware of the escaping scheme (html)
* They have to take care of escaping themselves OR they must be fed with pre-escaped data.

I think you missed a bit of information from the original ticket text: my 'custom_br' filter does escape the data itself. (It can use the predefined twig escaper function to do that.) After that, it converts newlines to
tags. Everything ok, except the cosmetic thing of needing to use |safe although I know that custom_br always produces safe output.

If a developer marks a filter escape-safe, he/she has the responsibility to actually perform the escaping. I don't think it is a problem that the developer has to explicitly call an escaper in these cases. Thus, I'm still proposing this solution:

* if a filter is marked escape-safe, perform no automatic escaping
* otherwise, post-escape by default.

This should be ok for every case where auto-escape is enabled. But if it's not? As you said, this is a bit complicated, but could be cured by passing the auto-escape flag to the function.

Do you have any other solutions that would fix the 'trunc' or 'backwards' problem?

(One solution would be to feed 'safe' filters with escaped data and others with unescaped, but I think this isn't practical due to filter chaining.)

fabpot commented 14 years ago

Hello,

I see this one is still open, but has already been implemented.

I hope you could still reconsider the choice to escape the input before it goes to the filter.

The current decision is disadvantageous in that it makes writing escape-agnostic filters impossible: to me it seems that now there is no way of writing a generic |backwards filter, or even a |trunc filter!

Questions:

* what are the reasons to choose pre-escaping? 

* do you see problems with the solution I suggested in the ticket above?

fabpot commented 14 years ago

Actually there is a little problem with the original code: my version of nl2br (that does htmlspecialchars()) is safe in html escaping context, but not necessarily in other contexts, such as javascript.

So, a revised suggestion:

return array( 'custom_br' => array( 'function' => 'my_function_name' # required 'safe' => array('html'), # optional 'passEnv' => false, # optional ) ) );

literal commented 14 years ago

This issue is marked as a show-stopper for the 1.0-release together with the closely related issue #11.

The root cause for both issues seems to be the general difficulty of finding a consistent escaping mechanism.

Thus, I feel a more general evaluation of escaping and why it is so hard to do properly might be necessary. I can't offer a good solution yet, but maybe my thoughts on the topic might spark off a useful discussion.

The problem reminds me of the "magic quotes" horror in PHP itself.

[To those not familiar: PHP used to automatically MySQL-escape (not even standard SQL compatible) all recieved GET-, POST-, and COOKIE-data by default. So you could build SQL statements like "SELECT ... WHERE name='$name'" with $name coming from some form input without the risk of SQL injection or SQL syntax trouble through ' characters in $name itself. Of course this meant you had to unescape everything if you wanted to use it for anything else but SQL statements. E.g. echo 'You entered: ' . stripslashes($name); or else you might end up with output such as You entered: O\'Connor. Even worse: as "magic quotes" could be (and still can be) enabled/disabled in the PHP configuration, you would even have to check whether or not they were active. They fortunately don't default to "on" anymore.]

Now, magic quotes were a questionable attempt to protect the inexperienced or sloppy developer from SQL injection and unexpected SQL syntax errors. Just like auto-escaping in a template engine attempts to protect the template designer from HTML injection and invalid HTML.

Why have magic quotes been dropped (or now default to "off", rather)? Because nowadays we mostly use prepared statements, DB abstractions or ORMs which will take care of escaping literal values where necessary.

Back to (HTML-)escaping in Twig:

The equivalent of using prepared SQL statements might be to not have templates where HTML-tags and literal text are freely mixed. Instead one might consider a system where view rendering is done via a DOM tree object to which element nodes and text nodes can be added. When the (X)HTML-source would eventually be created from the DOM tree, it would automatically take care of escaping the text node and element attribute contents.

Now of course this would be a completely different approach and doesn't have much to do with Twig. Also it only applies to XML-data while Twig is also useful for plaintext emails, JavaScript or whatever.

But it highlights the real cause of the escaping trouble: The arbitrary mix of text content and tags in HTML templates.

An nl2br filter needs pre-escaped data because it creates tags which would be destroyed by post-escaping.
A truncate filter must run before escaping or else it might destroy HTML entities and cut off strings too early, because single characters are represented by multi-character entities.
An nl2br filter is HTML-specific while a truncate filter is output-format-agnostic, they just don't fit together very well.

If an HTML-specific filter had the possibility of marking out which parts of it's result are text and which are tags (  or   for nl2br), a general post-escaping scheme would be no problem. Only the text-parts would be escaped. But then of course, this would introduce a lot of complexity (filter results would be lists of different kinds of objects instead of plain strings)...

Let me sketch up a trickier example of HTML-related filter problems, which isn't even caused by escaping:

We want to display messages in a bulletin board system.
If the viewer did a search, we want to highlight all occurrences of the search term inside the message body (e.g. everything that contains foo is turned into foo).
Any URLs inside the message bodies shall be "linkified" (e.g. everything that starts with http:// is turned into <a href="URL">URL</a>).

Both tasks could be implemented as simple Twig-filters. Plain one-liners with regex-based string replacement. But they will never work together in peace.

[edit: github's markdown parser messed up the following example by an auto-link/escape combination inside the inline code sections, once again proving, how difficult escaping is... :-) I have now replaced "http" with "httx" and "www" with "wwx"]

Imagine this message: "Hey guys, you must check out httx://wwx.foobar.com/ !". Now if a viewer gets this message displayed as result of a search for "foobar":

Step 1, highlight search phrase: Hey guys, you must check out httx://wwx.foobar.com/ !

Step 2, linkify: Hey guys, you must check out <a href="httx://wwx.foobar.com/">httx://wwx.foobar.com/</a> !

The link won't work. Same result if the filters are applied in the reverse order.

Can you see how this is again related to the second filter not knowing what is text and what is tag?

Would it be an option to parse an input value for a filter and split it into tags, attribute-values and text if necessary? I'd say no, because a filter can't possibly know whether a tag was created by an earlier filter (and thus is an intended tag) or was part of the original value (and should usually be treated as literal text).

What I mean: In the example above, if there was message like "I've just learned about the <a href=""> HTML tag" the <a href=""> should be treated as literal text, be escaped to <a href=""> in the output and be marked as a search hit if someone searches for "<a href="">" (all provided there is no markup/markdown allowed in the messages).

Sorry for the length of this comment, but this really is a tricky topic...

fabpot commented 14 years ago

Thanks for your input. This is much appreciated as it adds useful information to the discussion... more food for thought.

literal commented 14 years ago

I have been using Twig some more in the meantime and I have also done some more thinking on the escaping topic.

So far I have resorted to not using autoescaping as I find it more or less useless in the present form. Instead my HTML templates are littered with countless |es.

As I pointed out above, the most basic problem is that general filters (e.g. string truncate) might choke on escaped input while HTML-specific filters (e.g. nl2br) will fail if their output is escaped.

So if you want to do {{ value | upper | nl2br }} escaping has to be done between the two filters: {{ value | upper | e | nl2br }}. You dont want & to become &AMP;, do you?

One could call htmlspecialchars() in the nl2br filter implementation and declare it an escaper to make this work with auto-escaping. This is more or less what the original poster suggested. But it is only a very poor work-around. It will not permit chaining of multiple HTML-specific filters and it will still not allow for a simple {{ value | upper }} as long as data is pre-escaped.

I propose to switch from pre-autoescaping to post-autoescaping. It will be easier to implement and understand as there can be many inputs to an expression but only one output. This would also mean to drop the distinction between literals and variable contents when it comes to escaping. The docs: "Automatic escaping is applied to filter arguments, except for literals". (And except results of other functions/filters I suppose...?) In the current form it is sometimes hard to understand what will be escaped where. So let's just escape the final expression result like a trailing |e would. Stuff like URL construction, translation, substring search/replace and all the current built-in filters would work with no worries in any escaping scheme.

So, where does that leave nl2br and the like? They would have to be declared special post-escaping filters. And that seems just logical to me. They are special, as they only work in an HTML context.

Any filter declared "post-escape" should be called after the implicit |e added by autoescaping.

Now what if someone wrote {{ value | nl2br | upper }} and autoescaping is on?

Twig could silently reorder the filters, but that might be unexpected and hard to debug if the template's author really meant to uppercase the escaped data.
Twig could escape before the first "post-escape" filter in the chain and thus pass the escaped and nl2br'ed string to upper. That would honor the authors intent but also make the problem hard to debug if the filter order was a mistake.
Twig could throw an exception. That's my favorite option. Not 'nice' but most straightforward. If anyone really wanted to uppercase the HTML-escaped string, they would have to disable autoescape and use an explicit |e.

PS: My suggestions do not solve the multiple HTML-specific filter conflict sketched above (linkify vs. mark search-hits). I have (almost) solved that by splitting the search-hit highlighting into a pre and post part. But I'm far from happy:

{{ messageBody | markSearchHits(searchPhrase) | e | linkify | nl2br | applyHitMarkers }}

markSearchHits puts {and } around the hits and escapes any literal curly braces and backslashes with a backslash. So it's not HTML-specific.

All {, }, \{, \} and \\ will survive HTML escaping.

applyHitMarkers removes any { or } between < and > (which can, after escaping has been done, only be link tags added by linkify), replaces the remaining {and } with HTML tags and unescapes the curly braces and backslashes prefixed by a backslash.

The remaining problems are:

A search for e. g. "http" will stop the linkifier from detecting a link as it will get something like ... {http}://....
A search for "visit httx://wwx" will result in invalid HTML tag nesting: visit <a href="httx://wwx.foo.bar">httx://wwx.foo.bar</a>.

arnaud-lb commented 14 years ago

I agree with you.

Now what if someone wrote {{ value | nl2br | upper }} and autoescaping is on?

An other possibility might be that Twig could do what the user asks for, i.e. pass value unescaped to nl2br, then pass the raw output of nl2br to upper. upper is not an escaping filter, so auto-post-escaping should be applied before printing. This is safe in all cases. The user should take care of using its filters in the correct order, {{ value | upper | nl2br }}.

Here are some related issues:

{ { "%s"|format(variable) } } outputs variable unescaped

The problem here is pre-escaping of the primary filter argument, without post-escaping. I proposed mostly the same escaping behavior than you :

don't do pre-escaping
use 'safe' / 'is_escaper' / etc to mark filters 'safe'
always apply post-escaping, except if
- the last filter of the chain is not an escaping filter
- or the value is a Node_Expression_Constant

Here are some examples : http://github.com/arnaud-lb/Twig/blob/436e946ad3b3ee73a589c03eb47aa98c6301a0a7/test/Twig/Tests/Fixtures/tags/autoescape/with_filters.test

fabpot commented 14 years ago

see https://github.com/fabpot/Twig/pull/158

twigphp / Twig

Marking filters as escape-safe #4