starwing / luautf8

a utf-8 support module for Lua and LuaJIT.
MIT License
406 stars 67 forks source link

utf8.offset behavior doesn't match standard Lua 5.3 library #21

Closed mpeterv closed 4 years ago

mpeterv commented 6 years ago

Testcase:

local lua_utf8 = require "lua-utf8"
print(utf8.offset("foo", 1))
print(lua_utf8.offset("foo", 1))

Output:

1
2   111

Lua version: Lua 5.3.5 Copyright (C) 1994-2018 Lua.org, PUC-Rio luautf8 version: luautf8 0.1.1-1 - A UTF-8 support module for Lua

mpeterv commented 6 years ago

I think the problem is that utf8.offset(s, char_index) is implemented via utf8.charpos(s, 1, char_index), but charpos treats the offset argument as a 0-based offset and not 1-based offset when there are three arguments.

mpeterv commented 6 years ago

Quick test of compatibility (run with Lua 5.3):

local lua_utf8 = require "lua-utf8"
local s = "a\u{256}bc"
local total = 0
local okay = 0

local function test(n, i)
   local ok, ret_lua = pcall(utf8.offset, s, n, i)
   ret_lua = ok and ret_lua or "error"
   local ret_lib = lua_utf8.offset(s, n, i)
   print(n, i, ret_lua, ret_lib)
   if ok then
      total = total + 1
      if ret_lua == ret_lib then okay = okay + 1 end
   end
end

for n = -4, 4 do
   test(n)

   for i = 1, 3 do
      test(n, i)
   end
end

print("Total ignoring errors on continuation bytes:", total)
print("Matching results:", okay)

Even ignoring errors initial position is a continuation byte that Lua version throws, out of 28 remaining cases only in 4 lua-utf version produces the same result as standard Lua. This seems quite far from advertised Some routines are the compatible for Lua 5.3's basic UTF-8 support library:

mpeterv commented 6 years ago

Ignoring the errors standard Lua gives when byte offset points to a continuation byte, simply not swapping the arguments (lua_insert in Lutf8_offset) when there are only two arguments fixes that case, then correcting the third argument by one fixes the three argument case for positing offsets. Not sure what to do with negative offsets yet.

starwing commented 6 years ago

good catch, I will look into this issue, could you make a pull request for it? :-)

mpeterv commented 6 years ago

I have not looked into figuring out the fix for negative offsets so I can't promise a fast PR. For my usecase I just switched to charpos function.

mpeterv commented 5 years ago

The compatibility test above (adjusted to pcall luaut8 version of offset, too) shows that there are still many discrepancies when the first argument is negative.

starwing commented 4 years ago

It seems all test are pass, so close the issue.