msva / lua-htmlparser

An HTML parser for lua.
231 stars 44 forks source link

String allocation exponential mistake #61

Closed GitSparTV closed 1 year ago

GitSparTV commented 1 year ago

https://github.com/msva/lua-htmlparser/blob/2ce09f82a4244c243d9fd0abe6c38e20411912f7/src/htmlparser/ElementNode.lua#L70

This line is wrong. At the first iteration this will save { + empty string + first element from set into s. Next iteration will be { first element from set + , + second element from set. This will grow on every iteration and on big files this just eats the whole RAM.

Please rewrite it to something like this:

Set.mt.__tostring = function (set)
    local list = {}

    for k in pairs(set) do
        list[#list + 1] = tostring(k)
    end

    return "{" .. table.concat(list, ", ") .. "}"
end
msva commented 1 year ago

Can you, please, share some test files that leads to "eating the whole RAM" for testing purposes, please? (I'd like to test, which of the concatenation algorythms, I have the ideas about would be best)

GitSparTV commented 1 year ago

I apologise, it's for sure exponential, but probably doesn't eat RAM entirely, probably my pretty linter that doesn't handle cyclic references in your library.

Your commit is for sure better than it was before, and the way you removed trailing comma by replacing with } is brilliant. My version still allocates less memory.

Here is GC benchmark (logarithmic graph, 10000 items, memory in MB):

image

(Lines is collectgarbage("count") in MB, circles is when default GC was triggered)

Red - old Green - new commit Purple - my suggestion

Time benchmark (10000 items, 1000 times, time in seconds): Old: 32.12795625 New commit: 0.457729306 My suggestion: 0.272079618

LuaJIT (Luvit engine) Code:

local uv = require("uv")

local memory_benchmark = {}
local actual_memory_benchmark_len = 0

-- preallocation
for i = 1, 10004 do
    memory_benchmark[i] = 0
end

local GCtriggers = {}
local actual_GCtriggers_len = 0

-- preallocation
for i = 1, 100 do
    GCtriggers[i] = 0
end

local function __tostring(set)
    local list = {"{"}

    for k in pairs(set) do
        list[#list + 1] = tostring(k)
        list[#list + 1] = ", "

        actual_memory_benchmark_len = actual_memory_benchmark_len + 1
        memory_benchmark[actual_memory_benchmark_len] = collectgarbage("count")
    end

    list[#list] = "}"

    return table.concat(list)
end

local function __tostring_mine(set)
    local list = {}

    for k in pairs(set) do
        list[#list + 1] = tostring(k)
        actual_memory_benchmark_len = actual_memory_benchmark_len + 1
        memory_benchmark[actual_memory_benchmark_len] = collectgarbage("count")
    end

    return "{" .. table.concat(list, ", ") .. "}"
end

local function __tostring_old(set)
    local s = "{"
    local sep = ""
    for k in pairs(set) do
        s = s .. sep .. tostring(k)
        sep = ", "

        actual_memory_benchmark_len = actual_memory_benchmark_len + 1
        memory_benchmark[actual_memory_benchmark_len] = collectgarbage("count")
    end
    return s .. "}"
end

local example = {}

for i = 1, 10000 do
    example[string.format("%05d", i)] = true
end

do
    local metatable

    local function CreateTrigger()
        debug.setmetatable(newproxy(), metatable)
    end

    metatable = {__gc = function() actual_GCtriggers_len = actual_GCtriggers_len + 1 GCtriggers[actual_GCtriggers_len] = actual_memory_benchmark_len CreateTrigger() end}

    CreateTrigger()
end

actual_memory_benchmark_len = actual_memory_benchmark_len + 1
memory_benchmark[actual_memory_benchmark_len] = collectgarbage("count")
collectgarbage()
collectgarbage()
actual_memory_benchmark_len = actual_memory_benchmark_len + 1
memory_benchmark[actual_memory_benchmark_len] = collectgarbage("count")
-- local hrtime = uv.hrtime
-- local timer_start = hrtime()
-- for i = 1, 1000 do
    __tostring_mine(example)
-- end
-- local timer_end = hrtime()

-- print("Time:", (timer_end - timer_start) / 1e+9)

actual_memory_benchmark_len = actual_memory_benchmark_len + 1
memory_benchmark[actual_memory_benchmark_len] = collectgarbage("count")
collectgarbage()
collectgarbage()
actual_memory_benchmark_len = actual_memory_benchmark_len + 1
memory_benchmark[actual_memory_benchmark_len] = collectgarbage("count")

for k = 1, actual_memory_benchmark_len do
    print(memory_benchmark[k] / 1024)
end

print("GCtriggers")

local inverse_map = {}

for k = 1, actual_GCtriggers_len do
    inverse_map[GCtriggers[k]] = (inverse_map[GCtriggers[k]] or 0) + 1
end

for k = 1, actual_memory_benchmark_len do
    if inverse_map[k] then print(inverse_map[k]) else print() end
end
msva commented 1 year ago

Interesting moment: I've tested different implementations, and that's a conclusion: 1) on my laptop almost all (except "old") implementations produces pretty-similar results, and even on plotter it is only visible that they differs only with huge amplification. 2) with amplification it is noticeable that your suggested implementation have a bit less values during work, but (visible even without amplification, and even on your plot above) have a most higher up-peak, and less lower down-peak at 10002-10004 steps. 3) And current implementation have a less higer and most lower peaks at the same steps :shrug:

image

So: 1) worktime difference is barely noticeable (at least, here :shrug:). Literally measured by kilobytes. 2) we have a tradeoff between little difference in consumption over all worktime with workend peaks size. :shrug:

And what about time consumption: It i strange, but I'm getting anoher results: image (last two runs is re-testing current algorythm :shrug:)

Although, that may be caused by the reason that I'm using OpenResty's optimized LuaJIT-2.1 fork :shrug:

So, I'm not really sure about best choice, so, if you insist, I can adopt your variant (with concatenation), but my test shows that table.concat(table.concat()) can be even better choice.

===========

For proofs about memory consumption, can you, please, plot this files on your plotter (which you've used for the picture above)?:

double_tbl_concat.csv old.csv sf.csv single_tbl_concat.csv suggested.csv

(csv extension is to make GH uploader happy. It is just output of your script with different functions specified inside)

msva commented 1 year ago

(sf is string::format, if any)