grafana / k6

A modern load testing tool, using Go and JavaScript - https://k6.io
GNU Affero General Public License v3.0
25.71k stars 1.26k forks source link

Fix error_code tag values #1104

Open na-- opened 5 years ago

na-- commented 5 years ago

Quite a lot of errors now result in an error_code metric tag value of 1000, i.e. the most generic catch-all error code we have. In a recent sample of production errors, here are the aggregated counts of errors with 1000 for error_code:

Error string Count
connection error: PROTOCOL_ERROR 4
EOF 112
~http: ContentLength=??? with Body length~ ??? 2
~http: Request.ContentLength=??? with nil~ Body 586
http2: server sent GOAWAY and closed the connection; LastStreamID=???, ErrCode=NO_ERROR, debug="" 116
~http2: Transport: cannot retry err [http2: Transport received Server's graceful shutdown GOAWAY] after Request.Body was written; define Request.GetBody to avoid this error~ 3
net/http: HTTP/1.x transport connection broken: malformed HTTP response "???" 1
net/http: HTTP/1.x transport connection broken: malformed HTTP status code "to" 1
net/http: request canceled 475
net/http: request canceled while waiting for connection 169
stream error: stream ID ???; CANCEL 3
stream error: stream ID ???; PROTOCOL_ERROR 43
stream error: stream ID ???; PROTOCOL_ERROR; invalid header field value "???" 29
tls: first record does not look like a TLS handshake 1
unexpected EOF 70
x509: cannot validate certificate for ??? because it doesn't contain any IP SANs 2
x509: certificate has expired or is not yet valid 3
x509: certificate is valid for ???, not ??? 13
x509: certificate signed by unknown authority 61

The http: Request.ContentLength=??? with nil Body error is better described in https://github.com/loadimpact/k6/issues/1094, while the GetBody-mentioning error is already fixed in master by https://github.com/loadimpact/k6/pull/1093.

Also, it might be a good idea to rename the current errCode type and the constants, as discussed in https://github.com/loadimpact/k6/pull/1102#discussion_r310889060

na-- commented 5 years ago

Slightly connected to this, but we should see if we properly detect the too many open files error. It deserves its own dedicated error code, and a message pointing users to instructions how to increase their OS limits.

na-- commented 5 years ago

Something else strange is that the connection reset by peer error message includes the local and remote IPs and port numbers, something that should be overwritten by this code, but for some reason isn't: https://github.com/loadimpact/k6/blob/a2077dd22a86662ef3cd699bc57b0048136e51ba/lib/netext/httpext/error_codes.go#L137-L138

na-- commented 4 years ago

We don't handle error messages and error codes for request timeouts properly. Consider the following script:

import http from "k6/http";

export default function () {
    let resp = http.get("https://httpbin.org/delay/5", { timeout: 3000 });
    console.log(resp.error);
    console.log(resp.error_code);
}

with k6 v0.26.1, we'd get something like this:

WARN[0003] Request Failed                                error="Get https://httpbin.org/delay/5: context deadline exceeded"
INFO[0003] context deadline exceeded                    
INFO[0003] 1000   

and with k6 v0.26.0:

WARN[0003] Request Failed                                error="Get https://httpbin.org/delay/5: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
INFO[0003] net/http: request canceled                   
INFO[0003] 1000 

The difference in error message is probably caused by https://github.com/loadimpact/k6/pull/1261, but in both cases we don't handle the error code properly.

vishalkuo commented 3 years ago

I've opened this to take a stab at timeouts - arguably it seems like this issue references a more holistic refactor of error codes but timeouts happen frequently enough that I think they're worth fixing in the short term