Closed mpeterv closed 4 years ago
I think the problem is that utf8.offset(s, char_index) is implemented via utf8.charpos(s, 1, char_index), but charpos treats the offset argument as a 0-based offset and not 1-based offset when there are three arguments.
Quick test of compatibility (run with Lua 5.3):
local lua_utf8 = require "lua-utf8"
local s = "a\u{256}bc"
local total = 0
local okay = 0
local function test(n, i)
local ok, ret_lua = pcall(utf8.offset, s, n, i)
ret_lua = ok and ret_lua or "error"
local ret_lib = lua_utf8.offset(s, n, i)
print(n, i, ret_lua, ret_lib)
if ok then
total = total + 1
if ret_lua == ret_lib then okay = okay + 1 end
end
end
for n = -4, 4 do
test(n)
for i = 1, 3 do
test(n, i)
end
end
print("Total ignoring errors on continuation bytes:", total)
print("Matching results:", okay)
Even ignoring errors initial position is a continuation byte
that Lua version throws, out of 28 remaining cases only in 4 lua-utf version produces the same result as standard Lua. This seems quite far from advertised Some routines are the compatible for Lua 5.3's basic UTF-8 support library:
Ignoring the errors standard Lua gives when byte offset points to a continuation byte, simply not swapping the arguments (lua_insert in Lutf8_offset) when there are only two arguments fixes that case, then correcting the third argument by one fixes the three argument case for positing offsets. Not sure what to do with negative offsets yet.
good catch, I will look into this issue, could you make a pull request for it? :-)
I have not looked into figuring out the fix for negative offsets so I can't promise a fast PR. For my usecase I just switched to charpos
function.
The compatibility test above (adjusted to pcall luaut8 version of offset, too) shows that there are still many discrepancies when the first argument is negative.
It seems all test are pass, so close the issue.
Testcase:
Output:
Lua version:
Lua 5.3.5 Copyright (C) 1994-2018 Lua.org, PUC-Rio
luautf8 version:luautf8 0.1.1-1 - A UTF-8 support module for Lua