heroiclabs / nakama-cpp

Generic C/C++ client for Nakama server.
https://heroiclabs.com/docs/cpp-client-guide
Apache License 2.0
73 stars 25 forks source link

Crash in ppltasks Library #41

Closed MarkVabulas closed 1 year ago

MarkVabulas commented 3 years ago

Background: Compiled the Nakama c++ client library, including CppRestSDK, gRPC, SSL, RapidJson, and zlib, for use in ARM64 on the Hololens 2. Using C++/WinRT but not using the /ZW switch so it's still conformal C++17. Everything compiles correctly and the program runs properly on the Hololens when deployed. (This isn't the emulator version, it's the actual ARM64 hardware.)

I copy/pasted the default example class from nakama-cmake-client-example.cpp into our code, as a way to bootstrap the development. Dropped the class into our App class, and the default initializer starts as expected. Inside _client->authenticateDevice, the software hits an exception inside ppltasks at line 1910 when the "_LockHolder" mutex goes out of scope and UNLOCKS the mutex. It does sucessfully lock it. Also, from what I can tell, this is also NOT the first time through this code in the program, because when I put a breakpoint on it, it doesn't crash the first time through.

The callstack looks as such:

.exe!std::_Mutex_base::unlock() Line 67
.exe!std::lock_guard<std::mutex>::~lock_guard<std::mutex>() Line 444
.exe!Concurrency::details::_Task_impl_base::_ScheduleContinuation(Concurrency::details::_ContinuationTaskHandleBase * _PTaskHandle) Line 1915
.exe!Concurrency::task<web::http::http_response>::_ThenImpl<web::http::http_response,std::function<void __cdecl(Concurrency::task<web::http::http_response>)>>(const std::function<void __cdecl(Concurrency::task<web::http::http_response>)> & _Func, Concurrency::details::_ThenImplOptions & _Options) Line 3927
.exe!Concurrency::task<web::http::http_response>::then<void <lambda>(Concurrency::task<web::http::http_response>)>(const Nakama::NHttpClientCppRest::request::__l2::void <lambda>(Concurrency::task<web::http::http_response>) & _Func) Line 3188
.exe!Nakama::NHttpClientCppRest::request(const Nakama::NHttpRequest & req, const std::function<void __cdecl(std::shared_ptr<Nakama::NHttpResponse>)> & callback) Line 151
.exe!Nakama::RestClient::sendReq(Nakama::RestReqContext * ctx, Nakama::NHttpReqMethod method, std::string && path, std::string && body, std::multimap<std::string,std::string,std::less<std::string>,std::allocator<std::pair<std::string const ,std::string>>> && args) Line 160
.exe!Nakama::RestClient::authenticateDevice(const std::string & id, const std::optional<std::string> & username, const std::optional<bool> & create, const std::map<std::string,std::string,std::less<std::string>,std::allocator<std::pair<std::string const ,std::string>>> & vars, std::function<void __cdecl(std::shared_ptr<Nakama::NSessionInterface>)> successCallback, std::function<void __cdecl(Nakama::NError const &)> errorCallback) Line 340
.exe!NakamaSessionManager::start(const std::string & deviceId) Line 116
.exe!winrt::App::Scene::{ctor}::__l2::<lambda>() Line 380

Let me know if you need more information, I would love to resolve this.

Side note: When I run it using the gRPC client, I get no crash but it's not seeing my Nakama server, so that's why I'm using createDefaultClient instead of createGRPCClient.

mofirouz commented 3 years ago

I'll let @lugehorsam pipe in for the C++ issue - but on:

Side note: When I run it using the gRPC client, I get no crash but it's not seeing my Nakama server, so that's why I'm using createDefaultClient instead of createGRPCClient.

What port do you try to connect the gRPC client on? 7349?

MarkVabulas commented 3 years ago

I tried to connect on 7349 and -1 (the code comment says that's the auto-detect port, so I assume each protocol will know which port). I followed the recommended port listed here: https://heroiclabs.com/docs/install-configuration/

The Nakama console is working properly, on my local machine on the same LAN as the device, using the default docker settings from the yaml file.

MarkVabulas commented 3 years ago

Here's my forked repository and the changes I needed to make the to build system in order for it to work. Basically, I removed the use of the third_party directory entirely, by bootstrapping in vcpkg. I fixed the conflicts with the protobuf versions as well.

https://github.com/MarkVabulas/nakama-cpp/commit/af4fe518f6346a26f74e46d7b97b9ed5b04a363c

You can see that the changes made to the system are very minor, and we're not using C++/CW or any nonsense like that. The compiler settings are included in the CMakeSettings.json file.

Additionally, here's the vcpkg list output of the installed libraries+versions it's pulling from, if you feel that matters. I'm trying to do this in the most consistent way possible by removing sources of possible configuration errors etc. It was necessary to do these changes because even in x86 or x64 mode with a basic windows build of the client, it completely crashes and burns with a modern VC++ compiler, especially the test suite. ``` abseil:arm64-windows 2021-03-24#1 an open-source collection designed to augment th... boost-*:arm64-windows 1.76.0 Boost * module boost:arm64-windows 1.76.0 Peer-reviewed portable C++ source libraries brotli:arm64-windows 1.0.9#1 a generic-purpose lossless compression algorithm... bzip2:arm64-windows 1.0.8#2 bzip2 is a freely available, patent free, high-q... c-ares:arm64-windows 1.17.1#1 A C library for asynchronous DNS requests cpprestsdk:arm64-windows 2.10.18 C++11 JSON, REST, and OAuth library cpprestsdk[brotli]:arm64-windows Brotli compression support cpprestsdk[compression]:arm64-windows HTTP Compression support cpprestsdk[default-features]:arm64-windows Features installed by default cpprestsdk[websockets]:arm64-windows Websockets support grpc:arm64-windows 1.37.0#2 An RPC library and framework liblzma:arm64-windows 5.2.5#2 Compression library with an API similar to that ... openssl:arm64-windows 1.1.1k#6 OpenSSL is an open source project that provides ... protobuf:arm64-windows 3.15.8#3 Protocol Buffers - Google's data interchange format rapidjson:arm64-windows 2020-09-14#1 A fast JSON parser/generator for C++ with both S... re2:arm64-windows 2020-10-01 RE2 is a fast, safe, thread-friendly alternative... upb:arm64-windows 2020-12-19#1 μpb (often written 'upb') is a small protobuf i... websocketpp:arm64-windows 0.8.2#1 Library that implements RFC6455 The WebSocket Pr... websocketpp[recommended]:arm64-windows Use recommended dependencies zlib:arm64-windows 1.2.11#10 A compression library zstd:arm64-windows 1.4.9 Zstandard - Fast real-time compression algorithm ``` The next step is just drop the Nakama-cpp sources into my program, and stop using Nakama-cpp as an external library, instead co-opting Nuget for external references and just force Nakama-cpp to live as a first-class citizen in our codebase. It's really not a lot of code. I can't find again the source (from 2016!), but I did manage to read over the weekend that ppltasks is being completely phased out of compatibility going forward for development, in favor of C++20 provided coroutines. If my assumptions are true about the theoretically forward-looking development of Microsoft relative to their C++/WinRT implementations on the Arm64 and WindowsARM, I think there might be conflicts between ppltasks and Winrt coroutines. If this is the case, I'll end up refactoring the internals of Nakama-cpp using the C++/WinRT coroutine abstractions. Interestingly, this will actually be portable to other systems through people boostrapping the C++/WinRT abstractions for Linux, for example.

After the wall of text, I'm still unable to get the clients to connect, neither due to the crash in ppltasks's mutex unlocking, nor through gRPC's native client code. Is there any insight you might be able to provide as to a better direction to go or maybe an error in my reasoning for how I re-implemented Nakama-cpp's library and build system?

novabyte commented 3 years ago

@MarkVabulas Thanks for the comprehensive response its really useful to help deconstruct the problem and improve the open source code to make it more flexible.

Is there any insight you might be able to provide as to a better direction to go or maybe an error in my reasoning for how I re-implemented Nakama-cpp's library and build system?

I think the next place to look is for us to know more about how the server environment is set up. Is Nakama behind a load balancer? If so what type of load balancer is it and does it handle HTTP2 traffic which is required for GRPC?

MarkVabulas commented 3 years ago

The Nakama server is running on my local development machine. I installed it following the instructions available at https://heroiclabs.com/docs/nakama/getting-started/docker-quickstart/

docker-compose.txt

As for local network configuration, I'm on a private NAT'd LAN working remotely at home, on the 192.168.2.* range. The only machines on this network are my work machines, so I comfortably disabled my development machine's firewall, so that's not an issue. I've also uploaded the config.yaml file exported from the web interface, to show the current config post-installation for the Nakama server, if that helps.

config.yaml.txt

Lastly, I am able to access the Nakama console from the headset, as well as other machines on the network, so it's clearly not a firewall issue. I hope this helps, it's literally a vanilla installation with no extra bells and whistles. Implementation of more advanced server strategies can wait until after I have the C++ client code up and connecting, and will be handled by our other server engineers.

novabyte commented 3 years ago

Yep that helps. If it's a local set up you won't need to worry about how GRPC traffic is handled. Can you share the code you use to set up the client object?

MarkVabulas commented 3 years ago

Here's the code I've copied and modified from the example:

`struct NakamaSessionManager { NakamaSessionManager() noexcept { NClientParameters parameters;

    // set server end point
    parameters.serverKey = "defaultkey";
    parameters.host = "192.168.2.14";
    parameters.port = 7350;

    _client = createDefaultClient(parameters);
}

void start(const string& deviceId)
{
    // to do: read session token from your storage
    string sessionToken;

    if (!sessionToken.empty())
    {
        // Lets check if we can restore a cached session.
        auto session = restoreSession(sessionToken);

        if (!session->isExpired())
        {
            // Session was valid and is restored now.
            _session = session;
            LogNormal(L"restored session with token: " + to_wstring(session->getAuthToken()));
            onAuthenticated();
            return;
        }
    }

    const auto successCallback = [this](NSessionPtr session)
    {
        // to do: save session token in your storage
        _session = session;
        LogNormal(L"session token: " + to_wstring(session->getAuthToken()));
        onAuthenticated();
    };

    const auto errorCallback = [](const NError& error)
    {
        LogNormal("RT Error (start): " + toString(error));
    };

    if (_client)
        _client->authenticateDevice(deviceId, opt::nullopt, true, {}, successCallback, errorCallback);
}

void tick()
{
    if (_client)
        _client->tick();
    if (_rtClient)
        _rtClient->tick();
}

void onAuthenticated()
{
    const auto successCallback = [this](const NAccount& account)
    {
        _account = account;

        LogNormal(L"account user id: " + to_wstring(account.user.id));
        LogNormal(L"account user created at: " + to_wstring(account.user.createdAt));
    };

    const auto errorCallback = [](const NError& error)
    {
        LogNormal("RT Error (onAuthenticated): " + toString(error));
    };

    if (!_client)
        return;

    if (!_rtClient)
        return;

    _client->getAccount(_session, successCallback, errorCallback);

    // create real-time client (will use websocket transport)
    _rtClient = _client->createRtClient();

    _listener.setConnectCallback([this]()
    {
        LogNormal(L"Socket connected");

        _rtClient->updateStatus("Enjoying music", []()
            {
                LogNormal(L"Status updated");
            });
    });

    _rtClient->setListener(&_listener);

    _rtClient->connect(_session, true);
}

NClientPtr _client;
NSessionPtr _session;
NRtClientPtr _rtClient;
NRtDefaultClientListener _listener;
std::optional<NAccount> _account;
std::optional<NMatch> _currentRoom;

};`

To instantiate it, I've got an instance of the class in my main app's class:

NakamaSessionManager ServerSession;

To use it, inside my app's default constructor which is called after the main app starts (not pre-WinMain, but post WinMain through heap allocation), I have the call:

ServerSession.start("SampleHardwareID");

During my main loop, which executes not long after, at the start of the loop, I added the code:

ServerSession.tick();

The problem I have is that the code crashes during the ServerSession.start() when I'm using createDefaultClient or createRestClient, or it fails with a log message when using createGrpcClient then continues to march away through normal program operations. I do update the port number in the config parameters.port = 7350; when I try the different implementations.

I truly am trying to just copy/paste the default code out of the nakama-cpp codebase+examples, with no unnecessary modifications. Once I have at least a connection going, I'm going to go crazy with full implementation, but that's a pretty simple task once we have liftoff.

MarkVabulas commented 3 years ago

Additionally, when the ppltasks mutex goes out of scope and unlocks then crashes, I've already checked and the other memory around it isn't corrupted and it's not trying to dereference a nullptr or something trivial like that. It appears to be a valid block of memory with properly initialized variables.

MarkVabulas commented 3 years ago

Interestingly, if I change the offending code from

auto task = _client->request(request);
(void) task.then([context_wptr, reqId, hasCallback](pplx::task<http_response> previousTask)
{
  try
    {
      http_response response = previousTask.get();

to

http_response response = _client->request(request).get();
{
  try
    {

I instead get a DIFFERENT crash in ppltasks, but again with unlocking a mutex, however a different one.

The callstack now looks like:

.exe!std::_Mutex_base::unlock() Line 67 .exe!std::unique_lock::~unique_lock() Line 189 .exe!Concurrency::details::_TaskCollectionBaseImpl::WaitUntilStateChangedTo(Concurrency::details::_TaskCollectionBaseImpl::_TaskCollectionState _State) Line 191 .exe!Concurrency::details::_TaskCollectionBaseImpl::_Wait() Line 232 .exe!Concurrency::details::_TaskCollectionBaseImpl::_RunAndWait() Line 227 .exe!Concurrency::details::_Task_impl_base::_Wait() Line 1580 .exe!Concurrency::task::get() Line 3319 .exe!Nakama::NHttpClientCppRest::request(const Nakama::NHttpRequest & req, const std::function<void cdecl(std::shared_ptr)> & callback) Line 150 .exe!Nakama::RestClient::sendReq(Nakama::RestReqContext * ctx, Nakama::NHttpReqMethod method, std::string && path, std::string && body, std::multimap<std::string,std::string,std::less,std::allocator<std::pair<std::string const ,std::string>>> && args) Line 161 .exe!Nakama::RestClient::authenticateDevice(const std::string & id, const std::optional & username, const std::optional & create, const std::map<std::string,std::string,std::less,std::allocator<std::pair<std::string const ,std::string>>> & vars, std::function<void __cdecl(std::shared_ptr)> successCallback, std::function<void cdecl(Nakama::NError const &)> errorCallback) Line 341 .exe!NakamaSessionManager::start(const std::string & deviceId) Line 61

Based on this, I would say that the issue is directly related to ppltasks. Whether we're using the async continuation as normal, or calling an immediate .get() on it, we're getting crashes when unlocking mutexes in ppltasks.

It's also worth mentioning that this is all after I have dropped Namaka-cpp's source code into my project as planned, so it's no longer an issue with potential compiler/library conflicts. The behavior is as before, including the exact same crash as originally described.

Do you guys have any idea why ppltasks might have issues with mutex unlocking? This is extremely strange. From auditing the nakama-cpp example code myself, I'm not seeing anything in the implementations of the coroutines that suggest that it's being used improperly; however, maybe I'm missing some sort of library setup on my end that could fix it.

Side note: I did try to use the #define CPPREST_FORCE_PPLX to force using the internal pplx library included with cpprest, but that ended up with linker errors that I couldn't figure out how to remedy, since cpprest is coming in through vcpkg and I can't easily control the compiler flags like that without editing portfiles.

MarkVabulas commented 3 years ago

So after significant amounts of research and digging, I ran across a very unhelpful article which I think the developers at Nakama should read.

Pitfalls mixing PPL+await This article also forwards on to Creating an apartment-aware PPL task from nothing

I tried implementing their "fixes" through modifying NHttpClientCppRest.cpp so that it would start with a task in the correct apartment, but that quickly became impossible. Due to Microsoft's bone-headed mixture of using UWP | WinRT | C++/CX almost interchangably in the past, and none of them actually having anything to do with C++/WinRT, it quickly became necessary to try to implement the coroutine tasks in a forward operating manner through IAsyncOperation. But chaining together IAsyncOperation in parallel with Concurrency::task doesn't actually work.

This brings me to the transitioning documents for how to port from ppl/Concurrency and/or C++/CX to modern C++17/C++20 using C++/WinRT, Move to C++/WinRT from C++/CX. Since we're not using the /ZW switch (C++/CX), we fall into the note on the page; the operative quote is:

If you have a Windows Runtime component project, then porting in one pass is your only option. A Windows Runtime component project written in C++ must contain either all C++/CX source code, or all C++/WinRT source code. They can't coexist in this project type.

In effect, using CppRestSDK is a no-go, since it relies heavily on ppltasks, which are fundamentally incompatible with the future evolution of projects on future Microsoft operating systems, such as Xbox and Hololens. Of course, ppl will continue to be used for desktop usage, but in walled-garden ecosystems, Microsoft is demanding a certain level of conformity. C++/CX, CppRestSDK, and ppltasks have all been deprecated, officially.

So, in the end, I'm going to have to reimplement the Nakama-cpp library using WinRT libraries as a drop-in replacement of the CppRestSDK/Websockets. At least it's almost a 1-to-1 replacement.

Dimon4eg commented 3 years ago

Instead of CppRestSDK try to use https://github.com/machinezone/IXWebSocket It has websocket & HTTP support

lugehorsam commented 3 years ago

Thank you @MarkVabulas for the investigation on Hololens support. If you do wind up writing your own adapter, feel free to PR it and we can review it for bringing into the cpp client repository.

MarkVabulas commented 3 years ago

The adapter has been written in C++20 using native C++/WinRT libraries distributed with Visual Studio 2019. It's almost a direct copy/paste of the CppRestSDK code, because the WinRT implementation is just a migrated and cleanly implemented port of the CppRestSDK as-is. When I get a moment to commit it to my fork of nakama-cpp, I'll submit the PR.

For the record, after rewriting it, both the RtClient and the normal Rest Client are functioning properly, first try, happily.

lugehorsam commented 2 years ago

Hey @MarkVabulas we've refactored the build system and dependencies that Nakama uses and have removed the need for CPPRestSDK. Let us know how this affects your Hololens work.

lugehorsam commented 1 year ago

Closing due to being resolved by author and codebase changing significantly since it last occurred.