openresty / lua-nginx-module

Embed the Power of Lua into NGINX HTTP servers
https://openresty.org/
11.3k stars 2.03k forks source link

Lightuserdata on ARM64 #1152

Closed agentzh closed 5 years ago

agentzh commented 7 years ago

The original discussion started here:

https://github.com/openresty/lua-nginx-module/issues/757#issuecomment-328519741

But it really does not belong there.

agentzh commented 7 years ago

We could strip off the 48-bit value for non-pointer lightuserdata values used solely for hash table keys (in the registry hash table, for example), as implemented in #1151, but we cannot do that for real C pointers like cycle. Maybe the ultimate solution here is to avoid using lightuserdata for such C pointer values but use proper cdata objects which do not have this limitation in LuaJIT's GC64 mode.

dndx commented 7 years ago

Add to that, another place that is definitely problematic is

https://github.com/openresty/lua-nginx-module/blob/master/src/ngx_http_lua_util.h#L354

Now we can hack around all places with the mask and restores the original address when reading, but I fear that is gonna get ugly very soon and may break again once the memory layout changes (mostly due to new architecture). It seems to me there is no portable way of fixing this unless LuaJIT stops using memory addresses to piggyback information.

Background reading:

  1. https://github.com/LuaJIT/LuaJIT/issues/49
  2. https://github.com/LuaJIT/LuaJIT/issues/156
agentzh commented 7 years ago

@dndx No, you cannot disable GC64 on ARM64 since GC64 is required on that architecture.

The only hole here seems like the lightuserdata thing. We can replace it with FFI cdata, at least in theory, I think.

agentzh commented 7 years ago

@thehowl Is it possible to configure your Linux kernel to use the "memory layout with 64KB pages + 2 levels" as mentioned in LuaJIT/LuaJIT#49?

In the meantime, we'll try phasing out the use of lightuserdata for real C pointer data by migrating them to LuaJIT's FFI cdata objects to remove this limitation for good.

thehowl commented 7 years ago

Sorry about not having answered to your requests yet, I'm actively working on trying to change the bloody kernel, but it seems to be a crusade-level challenge to do that. Built a debian package for the custom kernel I compiled, rebooted aaand it was still using the old kernel (I'm guessing this has to do with the way Scaleway servers boot up, and I'm looking into the matter). Hopefully with some more research I should be able to try out what you asked me to do, though it's quite the nightmare it seems.

thehowl commented 7 years ago

Sorry for the double update, but I've finally tracked down the fact that I can't change the kernel. (sigh)

I'm using Scaleway, and as I found out, the only way to use a custom kernel is through kexec. And from what is said on this post ARMv8 servers (the ones I'm using) DO NOT have kexec and this has been requested here: https://github.com/scaleway/kernel-tools/issues/348 (no answer from the Scaleway/online.net team yet). So, I'm terribly sorry, but I really can't change the kernel to use 64kb pages + 2 levels :(

jonpspri commented 6 years ago

Checking in on this issue, because it's impacting an effort for multi-architecture OpenWhisk. I'm happy to do some lifting and build PR's if you can point me toward where the work needs to be done.

agentzh commented 6 years ago

@jonpspri If all the lightuserdata pointers have a fixed offset, then removing and re-adding that offset for all lightuserdata values would work. I'm not familiar with that architecture though. Help needed here.

jonpspri commented 6 years ago

Yeah, it doesn't look quite that simple, but it is close. There are two use cases that I see:

1) Pointers being used as keys to the registry -- no big deal; mask them to 47 bits and they'll still work without collision because of how the linker manages them. I think I see a branch where that work has already been done, but if not it's a simple macro/inline function to make it happen.

2) Pointers to actual data structures (i.e. ___ngx_req). More problematic because they need to remain intact. The best likely path is to convert them to use userdata. That shouldn't require a pile of effort because the instances are fairly well contained, but I will move slowly.

I'm setting up an Nginx::Test environment on a VM right now so I can make the changes on an x86 environment and ensure they don't break core capabilities before verifying on an ARM64.

agentzh commented 6 years ago

@jonpspri Using userdata for those pointer values will be much more expensive since they become GC objects now. A lightuserdata value can be fit into a TValue object directly without the need of GC management. I wonder if it's possible to re-patch the truncated lightuserdata value everytime we read from the Lua VM. This is still way cheaper than using GC objects.

agentzh commented 6 years ago

@jonpspri After thinking on this a bit more, or maybe we can make the GC object persist in the global env table, this way we can avoid allocating and deallocating new GC objects upon every invocation of filter handler invocation.

debayang commented 6 years ago

Hi @agentzh, @jonpspri Curious to know if any progress made on this issue .

We would like to have this issue resolved and would be willing to help out to get it to a resolution faster Thanks.

vielmetti commented 6 years ago

@agentzh - it looks like the neovim folks tackled this exact problem successfully in their PR https://github.com/neovim/neovim/pull/7880 "Remove all places where lightuserdata is used". I am looking through Github to find other similar issues and hopefully solutions to what looks like a common problem.

Another instance of a successful fix is at https://github.com/PowerDNS/pdns/pull/6512 from PowerDNS.

agentzh commented 6 years ago

@vielmetti I looked into the PRs you referenced and they look more like workarounds than real fixes. Please note that lua_pushstring and lua_pushlstring are WAY more expensive than lua_pushlightuserdata since the former requires creating new GC object and also the relatively expensive lj_str_new() call.

agentzh commented 6 years ago

Another thing we need to find a solution to is to port the opitmized lj_str_new() using Intel's fast SSE 4.2 instruction set's CRC instruction over to ARM64 equivalents otherwise the hash collisions on ARM64 can be really ugly:

https://github.com/openresty/luajit2/commit/7923c6343

Habbie commented 6 years ago

Please note that lua_pushstring and lua_pushlstring are WAY more expensive than lua_pushlightuserdata since the former requires creating new GC object and also the relatively expensive lj_str_new() call.

The PowerDNS link has benchmark results, for what it's worth.

Habbie commented 6 years ago

Pointers being used as keys to the registry -- no big deal; mask them to 47 bits and they'll still work without collision because of how the linker manages them.

Do you have pointers to documentation that makes this promise? As you can see from the PowerDNS results, the effective pain is just a few percent, but avoiding it would be even better.

dndx commented 6 years ago

@Habbie I am not sure if lua_pushstring will solve the problem here. We have to be able to store pointer addresses to data structures like ngx_request_t or ngx_cycle_t and just being able to store their type is not good enough. Unless you meant to store the value of the address as strings, but that seems to be very hacky compared to the cdata approach.

agentzh commented 6 years ago

@Habbie One common problem with those micro-benchmarks is that they may not be really testing things or cases that they are supposed to cover (like hash collisions are happening or irrelevant GC allocations or over overwhelming operations in other parts of the code shadowing the real overhead).

vielmetti commented 6 years ago

I'll open up the opportunity to get test cycles (or benchmark cycles) on a hosted arm64 machine at Packet. I run the Works on Arm project and we have a cluster (which @Habbie can speak to directly from having used it) for arm64 tests. This effort is funded by Arm and has resources both for short-term test and development and long-term CI. See http://github.com/worksonarm/cluster for details.

bobmon01 commented 6 years ago

@vielmetti Thanks Ed, for helping to revisit this issue. @agentzh Thanks for the points you raise about the performance issues and workarounds. We discovered this whole issue with lightuserdata while building ONAP components for an initial demo, where OpenResty is leveraged. We solved our immediate problem by leveraging one of the workarounds employed by others, but certainly the longer term solution must still be found and optimized performance established for Arm platforms.

Habbie commented 6 years ago

@Habbie I am not sure if lua_pushstring will solve the problem here. We have to be able to store pointer addresses to data structures like ngx_request_t or ngx_cycle_t and just being able to store their type is not good enough. Unless you meant to store the value of the address as strings, but that seems to be very hacky compared to the cdata approach.

Where addresses (the address of a type_info) were used as indexes, we replaced those with (semi-human readable) strings (the mangled type names) - https://github.com/ahupowerdns/luawrapper/pull/46

Where addresses needed to be stored for later retrieval, we boxed them in a full userdata - https://github.com/ahupowerdns/luawrapper/pull/45

Where we were using addresses as indexes but really didn't need to, we replaced that with luaL_ref - https://github.com/ahupowerdns/luawrapper/pull/44

We did not go for cdata as we'd like our code to stay portable to non-LuaJIT. If real world experience turns out much worse than the micro-benchmark, we'll look into making the change above a compile time choice.

Habbie commented 6 years ago

I'm setting up an Nginx::Test environment on a VM right now so I can make the changes on an x86 environment and ensure they don't break core capabilities before verifying on an ARM64.

Highly recommend this approach - I did the same for PowerDNS.

debayang commented 6 years ago

Another thing we need to find a solution to is to port the opitmized lj_str_new() using Intel's fast SSE 4.2 instruction set's CRC instruction over to ARM64 equivalents otherwise the hash collisions on ARM64 can be really ugly:

openresty/luajit2@7923c63

Thanks for pointing this out. I am trying to port this one to ARM64.

vielmetti commented 6 years ago

Noted in the above referenced issue that OpenWhisk for arm64 is blocked on this.

dknecht commented 6 years ago

@agentzh Any updates?

agentzh commented 6 years ago

We've finally prepared all the patches for all the OpenResty components. All our tests have successfully passed in a real arm64 box, including the valgrind memcheck tests. We're currently doing the final round of code review.

One good news is that the simplest Lua handler on OpenResty is also overall 10% faster with wrk benchmark on x86_64 (and we should see comparable results on all the other architectures we already supported). We generally didn't follow the suggestions from the LuaJIT community since they tend to make things slower rather than faster :)

luohuichang commented 5 years ago

@agentzh Hi, I have the same question, cound you tell me where to get the patches or the version?

nginx: [error] init_by_lua error: fail to require cjson.safe module,err:bad light userdata pointer stack traceback: [builtin#20]: at 0xffff88d18920 [C]: in function 'require' /usr/local/NSP/etc/router_tmp/router_config.lua:18: in main chunk

agentzh commented 5 years ago

@luohuichang Just try this for now:

https://openresty.org/download/openresty-1.15.8.1rc0.tar.gz

See below for building instructions:

https://openresty.org/en/installation.html

This is already resolved in the git master branch and this rc0 version. Closing this.

timusketeers commented 5 years ago

@agentzh , Hi, I have the same question with @luohuichang , and I try the version openresty-1.15.8.1rc0.tar.gz. It's really work. But when I integrate it with Kong. we get the following error:

[root@kwephicprc03693 bin]# ./kong 2019/03/04 17:39:56 [warn] 26412#0: 2 [lua] _G write guard:12: __newindex(): writing a global lua variable ('_NETTLE_LIB_PATH') which may lead to race conditions between concurrent requests, so prefer the use of 'local' variables stack traceback: ./kong:5: in function 'file_gen' init_worker_by_lua:45: in function [C]: in function 'xpcall' init_worker_by_lua:52: in function , context: ngx.timer 2019/03/04 17:39:56 [warn] 26412#0: 2 [lua] _G write guard:12: __newindex(): writing a global lua variable ('_KEY_COMPONENTS_PATH') which may lead to race conditions between concurrent requests, so prefer the use of 'local' variables stack traceback: ./kong:6: in function 'file_gen' init_worker_by_lua:45: in function [C]: in function 'xpcall' init_worker_by_lua:52: in function , context: ngx.timer

Below is the script file content named kong:

[root@kwephicprc03693 bin]# cat kong

!/usr/bin/env /opt/cloud/openresty/bin/resty

_NETTLE_LIB_PATH = "/opt/cloud/openresty/lualib/nettle-3.4/" _KEY_COMPONENTS_PATH = "/opt/cloud/openresty/nginx/conf/"

package.path = "./?.lua;./?/init.lua;" .. package.path package.cpath = "./?.so;./?/init.so;/opt/cloud/openresty/luajit/share/lua/5.1/?.so;/opt/cloud/openresty/luajit/share/lua/5.1/?/init.so;" .. package.cpath

timusketeers commented 5 years ago

the problem mentioned above occur on ARM platform.

agentzh commented 5 years ago

@zllovezj The new version of OpenResty just exposes misuse of globals in Kong (or its plugins). You should contact the Kong team to address them or support you to address them. Not an issue in OpenResty itself.

timusketeers commented 5 years ago

@agentzh thanks

timusketeers commented 5 years ago

@agentzh Hi, I find the problem "bad light userdata pointer stack traceback" in the lib lpeg.so on ARM64. and I find the source file change you solved the problem on openresty like below.

agentzh commented 5 years ago

@timusketeers You should contact the maintainers of that lpeg library instead. We are not maintaining that module. Maybe it actually uses the light userdata as real C pointers and dereferences them.

For Lua modules maintained by us, like lua-cjson, you can try it out by just using the latest OpenResty 1.15.8.1 RC1. And we'd like to hear about any issues with OpenResty's Lua C modules.

timusketeers commented 5 years ago

@agentzh I will contract the maintainers of lpeg, thanks