tempesta-tech / tempesta

All-in-one solution for high performance web content delivery and advanced protection against DDoS and web attacks
https://tempesta-tech.com/
GNU General Public License v2.0
621 stars 103 forks source link

HTTP normalization #2

Open krizhanovsky opened 10 years ago

krizhanovsky commented 10 years ago

We need at least minimal HTTP requests normalization (like ngx_http_parse_complex_uri() from Nginx does it). The normalization must be implemented as a part of current HTTP FSM (to avoid double processing as in Nginx) and write normalized fields to appropriate part of TfwHttpReq.

Normalization depending on back-end server personality also must be done, however this is very customizable and expensive logic, so it should be possible to switch the functionality off. Thus it must be implemented in http_norm.h as plugable HTTP FSM states.

Basically, we shouldn't do the normalization if Cache-Control: no-transform is presented.

Depends on #902. Linked with #1207.

krizhanovsky commented 10 years ago

Motivation

628 implements strict alphabet validation. However, for example, if an alphabet prohibits space (x20) in URI, then a request can bypass the validator using simple hex encoding GET /foo%20bar HTTP/1.1. One of dangerous real life example could be a response splitting attack:

/redir_lang.jsp?lang=foobar%0d%0aContent-Length:%200%0d%0a%0d%0a
HTTP/1.1%20200%20OK%0d%0aContent-Type:%20text/html%0d%0a
Content-Length:%2019%0d%0a%0d%0a<html>Shazam</html>

Allowed characters (bytes) must be taken from the same configuration options as for #628.

The encodings must be validated, see for example validate_url_encoding() from ModSecurity/apache2/re_operators.c.

Traffic normalization for intrusion detection is well studied, see for example Network Intrusion Detection: Evasion, Traffic Normalization, and End-to-End Protocol Semantics for L3-L4 NIDS.

HPACK & QPACK

Huffman decoder and encoders should be reviewed: at the moment we use 1-character decoding table which shows better performance than for nghttp2 and Nginx decoders https://github.com/tempesta-tech/tempesta/pull/1176#discussion_r257840173 . However, LiteSpeed uses large tables and batching to speedup Huffman encoding and decoding. Probably allowed characters (in sense of #628), already decoded (in sense of this issue) can be encoded in the large table. Also see #1207.

References:

Modes of operation

To not to hurt performance in cases which don't require strong security, the feature should be configurable per-vhost and per-location in the same sense as #688.

The transformation logic (as described in RFC 7230 5.7.2) for cookies and URI must be done by a configuration option (see also #902 ):

    http_norm <uri|cookie>
    content_security_mode <strict|transform|log>

, e.g.

    http_norm uri cookie;
    content_security_mode strict;

Following checks and transformations must be done:

Additional alphabets must be introduced to validate the strings after all the decodings. These alphabets may prevent double percent encodings (e.g. %2541 which is essencialy %41 after the first hex decoding and A after the second) by prohibiting %.

path must be executed after string decoding, e.g. path /a/b/abba/%2e%2e/abba must be decoded and after that .. removed. Also allowed alphabets must be verified after the decodings to block messages with CR, LF or zero byte.

Implementation requirements

If none of the normalization option is specified, then the HTTP parser must not perform detailed processing and just validate allowed alphabet as now, i.e. there must be zero performance overhead if normalization isn't required by configuration.

All the decoders and log and strict modes must copy an observed string to some buffer, because we need to forward percent-encoded URI. Since all the encodings are larger than an original data, the content_security_mode=transform mode must percent-recode decoded string in-place rewriting the original string. skb fragmentation should be used to handle data gap between shortened URI and HTTP/ part. The fragmentation must be done only once when all the decoders finish. A fallback to full data copying if number of fragments per buffer (#498) grows to more than a compile-time constant.

The normalization must be done before the cache processing to minimize different URI keys stored in the cache.

Since it's unwished to grow current HTTP parser states set, the logic must be done in the plugable (external) FSM by conditional unlikely jump (no need to support the compilation directive any more).

Also please fix the TODO for URI abs_path for more accurate filtering of injection attacks, e.g. it'd be good to be able to prohibit / in query string.

SIMD

There are SIMD implementations of UTF-8 validation or recoding (e.g. to/from UTF-16 or UTF-32). See for example

  1. https://github.com/lemire/fastvalidate-utf-8 and the paper https://arxiv.org/pdf/2010.03090.pdf
  2. https://r-libre.teluq.ca/2400/3/Transcoding%20Billions%20of%20Unicode%20Characters%20per%20Second%20with%20SIMD%20Instructions.pdf
  3. https://nullprogram.com/blog/2017/10/06/
  4. Adventures in SIMD-Thinking (part 2 of 2) - Bob Steagall - CppCon 2020, UTF-8 to UTF-32 Conversion Algorithm

However, probably it makes sense to sacrifice SIMD to do percent-decoding, UTF-8 validation, validation of allowed character sets (in sense of #628 ) and transformations (path or arguments) in single pass.

Tests and docs

Please update https://github.com/tempesta-tech/tempesta/wiki/Web-security Wiki page on finishing the task.

Further possible extensions

We leave back-end server personality normalization for further development if there will be any real requests. Probably this won't be needed since we're going to provide full Web server functionality and leave really heavy processing logic to dedicated WAF solutions.

HTTP responses also aren't normalized - we target initial attacks filtering instead of filtration of their consequences.

Also the decoders set is very restricted, e.g. there is no lower case conversion or Microsoft %U decoding or unicode normalization, so please keep in mind possible further extension of the decoder.

krizhanovsky commented 1 year ago

The issue was wrongly closed