optimization for masked payload calculate

hello, @agentzh, I found a serious performance problem while calculating masked payload via flame graph. According to the protocol.lua source code, I catch the TODO optimizations with string.buffer, however I haven't found the implementation in luajit api.

I used ffi string to do some optimizations as a work around and write the benchmark code below:

local bit = require("bit")
local ffi = require("ffi")
local str_char = string.char
local concat = table.concat
local byte = string.byte
local bxor = bit.bxor
local ffi_new = ffi.new
local ffi_string = ffi.string

local ok, new_tab = pcall(require, "table.new")
if not ok then
    new_tab = function (narr, nrec) return {} end
end

local masking_key = 0x0f3eca1d
local payload_len = 3200
local f = io.open("./111.wav", "rb")
local payload = f:read(payload_len)
f:close()

local count  = 100000

local function implement1()
    local bytes = new_tab(payload_len, 0)
    for i = 1, payload_len do
        bytes[i] = str_char(bxor(byte(payload, i),
                                byte(masking_key, (i - 1) % 4 + 1)))
    end
    local p = concat(bytes)
    return p
end

local function implement2()
    local buffer = ffi_new("char[?]", payload_len)
    for i = 1, payload_len do
        buffer[i-1] = bxor(byte(payload, i),
                                byte(masking_key, (i - 1) % 4 + 1))
    end
    local p = ffi_string(buffer, payload_len)
    return p
end

local function benchmark1()
    local start_time = ngx.now()
    for i = 1, count do
        implement1()
    end
    ngx.update_time()
    ngx.say("=========benchmark1 cost:", (ngx.now()-start_time) * 1000, " ms.")
end

local function benchmark2()
    local start_time = ngx.now()
    for i = 1, count do
        implement2()
    end
    ngx.update_time()
    ngx.say("=========benchmark2 cost:", (ngx.now()-start_time) * 1000, " ms.")
end

benchmark1()
benchmark2()

run the benchmark code and get the result below:

$ /usr/local/openresty/bin/resty test.lua
=========benchmark1 cost:6322.9999542236 ms.
=========benchmark2 cost:720.00002861023 ms.

Taking 100000 times calculating in the benckmark code, this ffi string implementation increased about 8-10 times performance.

openresty / lua-resty-websocket

optimization for masked payload calculate #49