krizhanovsky commented 10 years ago

We need at least minimal HTTP requests normalization (like ngx_http_parse_complex_uri() from Nginx does it). The normalization must be implemented as a part of current HTTP FSM (to avoid double processing as in Nginx) and write normalized fields to appropriate part of TfwHttpReq.

Normalization depending on back-end server personality also must be done, however this is very customizable and expensive logic, so it should be possible to switch the functionality off. Thus it must be implemented in http_norm.h as plugable HTTP FSM states.

Basically, we shouldn't do the normalization if Cache-Control: no-transform is presented.

Depends on #902. Linked with #1207.

krizhanovsky commented 10 years ago

Motivation

628 implements strict alphabet validation. However, for example, if an alphabet prohibits space (`x20`) in URI, then a request can bypass the validator using simple hex encoding `GET /foo%20bar HTTP/1.1`. One of dangerous real life example could be a response splitting attack:

/redir_lang.jsp?lang=foobar%0d%0aContent-Length:%200%0d%0a%0d%0a
HTTP/1.1%20200%20OK%0d%0aContent-Type:%20text/html%0d%0a
Content-Length:%2019%0d%0a%0d%0a<html>Shazam</html>

Allowed characters (bytes) must be taken from the same configuration options as for #628.

The encodings must be validated, see for example validate_url_encoding() from ModSecurity/apache2/re_operators.c.

Traffic normalization for intrusion detection is well studied, see for example Network Intrusion Detection: Evasion, Traffic Normalization, and End-to-End Protocol Semantics for L3-L4 NIDS.

HPACK & QPACK

Huffman decoder and encoders should be reviewed: at the moment we use 1-character decoding table which shows better performance than for nghttp2 and Nginx decoders https://github.com/tempesta-tech/tempesta/pull/1176#discussion_r257840173 . However, LiteSpeed uses large tables and batching to speedup Huffman encoding and decoding. Probably allowed characters (in sense of #628), already decoded (in sense of this issue) can be encoded in the large table. Also see #1207.

References:

Modes of operation

To not to hurt performance in cases which don't require strong security, the feature should be configurable per-vhost and per-location in the same sense as #688.

The transformation logic (as described in RFC 7230 5.7.2) for cookies and URI must be done by a configuration option (see also #902 ):

    http_norm <uri|cookie>
    content_security_mode <strict|transform|log>

, e.g.

    http_norm uri cookie;
    content_security_mode strict;

Following checks and transformations must be done:

url - decode percent-encoded string (double percent hex and first/second/double nibble hex aren't allowed). Messages with wrong hex strigs (e.g. http://foo.com/?%product=select 1,2,3, % isn't followed by 2 hex digets) must be blocked. Spaces may be represented in many ways, e.g. with + or %20 (see HTML URL Encoding Reference) - we don't need to do anything with it. RFC 3986 allows percent encoding in all parts of URI, but it's unclear how to deal e.g. with UTF-8 hostname, so we decode URI abs_path only.
utf8 - validate UTF-8 encoding: decode percent-encoded and validate UTF-8 bytes;
path - remove path traversals like /../ or // (see ngx_http_parse_complex_uri()) and translate \ to /.
pollution (subject for #1276) - take the 1st polluted HTTP parameter for URI or POST in content_security_mode=transform mode. In validation mode (w/o content_security_mode=log attribute) just builds a map of the parameters and ensures that there is no pollution. In content_security_mode=transform mode rewrites the URI (available for URIs only) and drops a request and writes a warning for content_security_mode=strict. HTTP parameter fragmentation, e.g.
```
http://test.com/url?a=1+select&b=1+from&c=base
```
is left for application-side WAF.

Additional alphabets must be introduced to validate the strings after all the decodings. These alphabets may prevent double percent encodings (e.g. %2541 which is essencialy %41 after the first hex decoding and A after the second) by prohibiting %.

path must be executed after string decoding, e.g. path /a/b/abba/%2e%2e/abba must be decoded and after that .. removed. Also allowed alphabets must be verified after the decodings to block messages with CR, LF or zero byte.

Implementation requirements

If none of the normalization option is specified, then the HTTP parser must not perform detailed processing and just validate allowed alphabet as now, i.e. there must be zero performance overhead if normalization isn't required by configuration.

All the decoders and log and strict modes must copy an observed string to some buffer, because we need to forward percent-encoded URI. Since all the encodings are larger than an original data, the content_security_mode=transform mode must percent-recode decoded string in-place rewriting the original string. skb fragmentation should be used to handle data gap between shortened URI and HTTP/ part. The fragmentation must be done only once when all the decoders finish. A fallback to full data copying if number of fragments per buffer (#498) grows to more than a compile-time constant.

The normalization must be done before the cache processing to minimize different URI keys stored in the cache.

Since it's unwished to grow current HTTP parser states set, the logic must be done in the plugable (external) FSM by conditional unlikely jump (no need to support the compilation directive any more).

Also please fix the TODO for URI abs_path for more accurate filtering of injection attacks, e.g. it'd be good to be able to prohibit / in query string.

SIMD

There are SIMD implementations of UTF-8 validation or recoding (e.g. to/from UTF-16 or UTF-32). See for example

https://github.com/lemire/fastvalidate-utf-8 and the paper https://arxiv.org/pdf/2010.03090.pdf
https://r-libre.teluq.ca/2400/3/Transcoding%20Billions%20of%20Unicode%20Characters%20per%20Second%20with%20SIMD%20Instructions.pdf
https://nullprogram.com/blog/2017/10/06/
Adventures in SIMD-Thinking (part 2 of 2) - Bob Steagall - CppCon 2020, UTF-8 to UTF-32 Conversion Algorithm

However, probably it makes sense to sacrifice SIMD to do percent-decoding, UTF-8 validation, validation of allowed character sets (in sense of #628 ) and transformations (path or arguments) in single pass.

Tests and docs

Please update https://github.com/tempesta-tech/tempesta/wiki/Web-security Wiki page on finishing the task.

[ ] At least one functional test for detection a response splitting attack is required.
[ ] Functional test for custom character sets https://github.com/tempesta-tech/tempesta-test/issues/3
[ ] Tempesta must block requests like GET /vulnerabilities/xss_d/?default=/%0aSet-Cookie:crlf-injection HTTP/2

Further possible extensions

We leave back-end server personality normalization for further development if there will be any real requests. Probably this won't be needed since we're going to provide full Web server functionality and leave really heavy processing logic to dedicated WAF solutions.

HTTP responses also aren't normalized - we target initial attacks filtering instead of filtration of their consequences.

Also the decoders set is very restricted, e.g. there is no lower case conversion or Microsoft %U decoding or unicode normalization, so please keep in mind possible further extension of the decoder.

krizhanovsky commented 1 year ago

The issue was wrongly closed

tempesta-tech / tempesta

HTTP normalization #2

Motivation

628 implements strict alphabet validation. However, for example, if an alphabet prohibits space (`x20`) in URI, then a request can bypass the validator using simple hex encoding `GET /foo%20bar HTTP/1.1`. One of dangerous real life example could be a response splitting attack:

HPACK & QPACK

References:

Modes of operation

Implementation requirements

SIMD

Tests and docs

Further possible extensions

tempesta-tech / tempesta

HTTP normalization #2

Motivation

628 implements strict alphabet validation. However, for example, if an alphabet prohibits space (x20) in URI, then a request can bypass the validator using simple hex encoding GET /foo%20bar HTTP/1.1. One of dangerous real life example could be a response splitting attack:

HPACK & QPACK

References:

Modes of operation

Implementation requirements

SIMD

Tests and docs

Further possible extensions

628 implements strict alphabet validation. However, for example, if an alphabet prohibits space (`x20`) in URI, then a request can bypass the validator using simple hex encoding `GET /foo%20bar HTTP/1.1`. One of dangerous real life example could be a response splitting attack: