I have created a few scripts to preprocess text corpus ~6MB. In order to keep text formatting I need to iterate over each line and do some text manipulations with it. This in turn produces PANIC: unprotected error in call to Lua API (not enough memory). I decided to try tds.Hash to keep my corpus table.
Here is the code I am using:
text_arr = tokenize(text)
text_arr = tds.Hash(text_arr)
-- replace rare tokens with <unk>
-- text_arr is a {idx: {tokens arr}}
for l=1,#text_arr do -- iterating lines {}
for t=1,#text_arr[l] do -- iterating tokens {}
-- rare is arr of rare words
for r=1,#rare do
if text_arr[l][t] == rare[r] then text_arr[l][t] = "<unk>" end
end
end
end
text_arr is a table of size 2900 and this 3 loop operation becomes really slow when using tds.Hash.
I am by no means a lua expert but am I doing something wrong?
I have created a few scripts to preprocess text corpus ~6MB. In order to keep text formatting I need to iterate over each line and do some text manipulations with it. This in turn produces
PANIC: unprotected error in call to Lua API (not enough memory)
. I decided to try tds.Hash to keep my corpus table.Here is the code I am using:
text_arr
is a table of size 2900 and this 3 loop operation becomes really slow when usingtds.Hash
. I am by no means a lua expert but am I doing something wrong?