Audio stutters when networking has latency (needs to be implemented using non-blocking networking)

AkiraJkr commented 6 years ago

The issue is simple, if you start online, no matter if it's local, hamachi, OpenVPN or anything, if you start WLAN(By doing something like, entering multiplayer mode, selecting guild hall menu), the game's audio suddenly stops following the frameskip, and completely ignores it, though the video frameskips just fine as if nothing happens.

In result, the audio is crackling, and the video is not, the game runs fine and properly.

To reproduce this, simply stress PPSSPP enough until you can't run the game at full speed without auto frameskip, then you'll notice the audio and video are frameskipping properly like they should, 100% speed, go to multiplayer mode in any game, and the audio will begin acting weird. Crackling and out of sync.

The normal should be the audio working normally, as it would if the game was running at 100% speed.

Funny fact about this issue: The audio plays properly for a bit when you Alt Tab.

AkiraJkr commented 6 years ago

Upon further testing, seems this problem only happens when there's players AND latency.

unknownbrackets commented 4 years ago

I think the issue here is that the networking code generally uses blocking calls. I don't think anything can work unless all networking is done with non-blocking operations.

In PPSSPP, when we hit a blocking operation (like reading from a file), we continue emulation - just on a different emulated PSP thread. To achieve this, we have to read files on a separate host OS thread from the emulator host OS thread, and it checks to see if things are complete.

If we block on the main emulator host OS thread, then we cannot run any other PSP threads (including ones that generate audio.) This will cause audio to run poorly and may even cause the game to crash or misbehave.

-[Unknown]

adenovan commented 4 years ago

@unknownbrackets can direct me into the main emulator host os thread ? all of #9882 implementation use non blocking operations , but still cannot figured where to init the thread so its not inside main emulator thread to simulate timeout properly.

many of HLE implementation on sceNetAdhoc is blocking the main thread , where is should separate it on source code to make another host os thread just for use in the networking. currently creating another thread inside the __NetAdhocInit function.

i need to fix this issues first before contiue to fix #12874 as digimon world redigitize use 2 thread networking and user main. when the networking module start from the game its heavily block the main thread because we re blocking inside HLE call make the game super slow even the network is connected on local router.

i cant figure it out how PPSSPP handle the thread and do some pooling properly to avoid blocking inside the SceNetAdhoc HLE module if not necessary.

there is already sound or io file implementation that can continue emulation when its supposed to block but i cannot figure it out on source code how thats work. maybe i can mimic that thread implementation to implement it properly.

unknownbrackets commented 4 years ago

I see a send() after a changeBlockingMode(socket->id, 0); so I think there's still some blocking network there.

So, here's what I'd suggest:

Create a new class, something like NetplayManager (or whatever name.) This is the class that will handle the ACTUAL networking calls, similar to AsyncIOManager.
On NetInit and NetShutdown, tell the NetplayManager to start and stop, kinda like in sceIo.cpp.
The trickiest part will be HLE that need to do multiple network operations in sequence, but we don't have to move everything at once. I'd start with one of the simpler calls that just does a single network operation - this will be easier to offload.

Instead of send(sock, data, sizeof(data), 0); call something like:

if (!netManager.Send(sock, data, sizeof(data), __KernelGetCurThread())) {
    __KernelWaitCurThread(WAITTYPE_NETWORK, sock, 0, 0, false, "net send");
    // Or just constantly have an event like cwcheat?
    CoreTiming::ScheduleEvent(usToCycles(100), checkNetworkEvent, __KernelGetCurThread());
}

After that, you're pretty much done with the HLE call. Inside the netManager, you'll need to put the operation into a queue so the runner can pick it up. There are many examples of this, including GLQueueRunner, AsyncIOManager, etc.
If it's a non-blocking mode socket, it can immediately call send(sock, data, sizeof(data), 0); and if this returns sizeof(data) then you are done, no queuing needed. In my above code example, this would return false so the thread would not be made to wait. If it sends partially, this would need to be told to the queue so it can send only the rest.
When the send is complete (on the other thread) it would flag it, but NOT directly call HLEKernel::ResumeFromWait(). Only the emulator thread can call this.
Instead, this is why we need to schedule a CoreTiming event. That runs on the emulator thread. It would ask netManager if the operation was done, and if so, it would call HLEKernel::ResumeFromWait(threadID, WAITTYPE_NETWORK, sock, result);. If not done, it would schedule another event to check again later. Careful: schedule too often, and you'll slow down emulation, schedule too inoften and there may be network lag.
At this point, simple send/recv funcs should work. The next step is "layering." Instead of calling ResumeFromWait always, the event will need to get smarter. I recommend organizing this like IoAsyncFinish. If another step in the operation is needed, you'll just keep the HLE thread waiting and tell the manager to do more.
I haven't noted it above, but most likely you'll need some std::map or similar of state per thread. In theory, there should only be one blocking network operation per thread, so you can store the current state (like if you're doing multi-stage network requests) per thread and just clear it at the same time as waking the HLE thread.

If you skip 6 above, it would in theory even be fine to keep the requests all blocking - but it might be better to use select(). I don't know if games make multiple network requests concurrently from multiple HLE threads.

-[Unknown]

adenovan commented 4 years ago

thank for the fast response and the advice look like i am getting the general overview how its work.

on my branch amultios-osx the implementation on library under is using select and i have all the flag to mark the send finish just need to sync it into main HLE thread.

we have many thread to do here in the NetPlayManager. from my experience there is so many blocking call in Adhoc like making connection , Sending Data , updating the Adhoc Control , and Updating the Matching context many of them use blocking call that must have its own host thread.

On my implementation simulating them to work fast use 4 host thread single port would not do it as the socket is overflowed with buffer easily and cannot keep up with the stream.

looks like we need many manager to simulate entire adhoc library here. well i guess i need to separate module like on the real firmware to simulate everything in their place. sony implementation looks like calling their library under frequently based on error code given.

WDYT its is good if i implement something like this ?

SceNetModule (NetLibrary) need 1 thread

Something like SocketManager (manage raw socket if theres any to switch protocol under and do the real work )

SceNetAdhocModule (AdhocLibary) need 2 threads

PdpNetworkManager (manage udp socket)
PtpNetworkManager (manage ptp socket)

sceNetAdhocctlModule ( control library) need 1 thread or probaly 2

AdhocControlManager (this absolutely need its own thread)
AdhocGameModeManager ( ring topology ) not sure yet if its need its own thread

sceNetAdhocMatching (matching library) need 1 thread also

MatchingControlManager

unknownbrackets commented 4 years ago

Right so first, I would inventory the possible calls. For example:

enum class NetOperationType {
    CONNECT,
    DISCONNECT,
    SENDTO,
    RECVFROM,
    ACCEPT,
    SELECT,
};

And then you would also define a structure for the data associated with each queue operation:

struct NetOperation {
    NetOperationType type;
    SceUID threadID;
    uint64_t socket;
    std::vector<uint8_t> data;
    // In game time - CoreTiming::GetGlobalTimeUsScaled().
    uint64_t timeoutUs;
    union {
        struct {
            sockaddr_in src;
        } recvfrom;
        ...
    };
};

So, then, when the queue hits that operation it would do something like:

    size_t bytes;
    bool done = false;

    switch (task.type) {
    case NetOperationType::RECVFROM:
        bytes = recvfrom(task.socket, task.data.data(), task.data.size(), 0, &task.recvfrom.src, sizeof(task.recvfrom.src));
        if (bytes == task.data.size()) {
            // We sent everything.
            done = true;
        } else if (bytes != 0) {
            // Need to not send the rest next try.
            task.data = std::vector<uint8_t>(task.data.begin() + bytes. task.data.end());
        }
        break;

    case ...
    }

    if (done) {
        std::lock_guard<std::mutex> guard(threadsToWakeLock_);
        threadsToWake_.push_back(task.threadID);
    }

And somewhere in this manager on the queue, it'd probably do something like:

while (!stopped) {
    fd_set reads;
    fd_set writes;
    fd_set excepts;
    GetSocketsFromQueue(&reads, &writes, &excepts);

    // We won't notice new queued operations during timeout.
    timeval timeout;
    timeout.tv_sec = 0;
    timeout.tv_usec = 10000;
    select(&reads, &writes, &excepts, &timeout);

    ProcessTasks();
    CheckTimeouts();
}

I'm not sure, but I wouldn't necessarily try to put all the different full operations on netManager - but I haven't looked into it deeply. Generally, I think it's better to keep validating arguments, dealing with PSP structs, etc. inside the HLE funcs.

I don't know if we need a separate thread for each host socket or socket type. I would think, using select(), we could simply multiplex.

-[Unknown]

adenovan commented 4 years ago

from my experience multiplex into one socket will absolutely broke the emulation timing i try that before and so many of mini freeze waiting for packet at least we have one socket to multiplex full operation of one protocol. if we on tunnel mode this is enough.

4 main socket will be

For full of operation on sceNetAdhocPdp multiplex PDP_STAT [255]
For full of operation sceNetAdhocPtp multiplex PTP_STAT [255]
For full operation of sceNetAdhocctl (not sure yet how many)
For full operation of sceNetAdhocMatching (not sure yet how many)

that 4 main socket can connected in background ahead of time and just emulate the timing properly either by connecting it into the host of adhoc or do purely peer 2 peer at the game boot.

what i cannot figure it out without a NetSocketManager higher layer abstraction is to handle the proadhoc protocol that do full P2P without multiplexing and sync the timing in background thread. we will have many socket to handle here syncing timeout , make sure the socket is still connected , can be flushed and so on. @ANR2ME might know better as he working in this one to make blocking call better. but without one thread per socket opened we will just ended in mini freeze again i guess that cause audio stutter.

to keep on with psp compability we need to write that abstraction layer first before supplying it into the protocol under. some of PacketManager also will be good as if we travelling over internet compressing the packet help alot with snappy. already test it on Amultios its does a great job but need to make sure someone can sync the background thread of that many opened socket without multiplexing (kinda a lot of task and thread running in the background).

some game that open many adhoc socket is phantasy star portable 2 its open 20 or more in 4 player play. this game tends to break a lot because of fail to connect or waiting something that not synced properly due to packet lost even in local this game still can break often.

this ticket also can be easily produced on Digimon World Re:Digitize the game will be just freeze and audio stopping if we re not supplying any packet on blocking call. even if we supply it the resource on main thread already locked by the game and just keep freezing.

unknownbrackets commented 4 years ago

Well, nginx famously multiplexes - it doesn't have to be slow. As long as you do it with non-blocking operations, the idea is you're just moving everything along one thing at a time.

For example, this would be slow:

select(...);
// fd is an fd ready for read, but BLOCKING.

// Must be ready now.
read(fd, data, size);

If only 1 byte is available, that will block and slow everything else down. Instead, you want to "keep the line moving." Read what you can, and then keep going:

select(...);
// fd is an fd ready for read, but NON blocking.

// Read what we can.
task.totalRead += read(fd, data + task.totalRead, size - task.totalRead);
if (task.totalRead < size) {
    // We must select() this again and go next time.
}

Non blocking operations are fast, because they read from the OS buffer. select() can get slow with lots of fds, but at PSP scales (i.e. not hundreds of open sockets), it's probably fine.

You want the multiplexer to just have ONE JOB. Its job is to talk to the network and just get stuff done. It shouldn't be making decisions, looping through arrays, writing structs to PSP RAM, deciding error codes, etc. That's a distraction from its one true purpose: quickly multiplexing on the data.

Instead, all those distractions can be handled outside while that thread is probably busy reading or writing more data to sockets.

Using a thread per socket just is the "blocking socket operations" model of thinking.

Definitely makes sense to use multiple sockets for sure, though.

If a game makes a blocking call, the socket should just be non-blocking still. We'll keep the HLE thread asleep, and the netManager will keep select() and multiplexing that socket each loop. When it finally reads all its data, we will wake the thread.

That way, we never once blocked, but from the PSP game's perspective, it was the same. The HLE thread was blocked until the read was complete. It doesn't need to know the difference.

-[Unknown]

adenovan commented 4 years ago

aha got it that makes sense we can implement it then by just implement that multiplexer and does not care with how many opened fd under as long its still connected and take all the other distraction outside.

one thing to handle is flushing the data outside and keep track with packet on queue and belong to which fd and we should discard if that happens. this can be tricky thought i think i can make it worked.

packet queue manager is absolutely a must here to handle that distraction later. i wonder if this can be implemented on real psp hardware also but looks like a good approach to tackle this problem.

anr2me commented 4 years ago

I see a send() after a changeBlockingMode(socket->id, 0); so I think there's still some blocking network there.

So, here's what I'd suggest:

1. Create a new class, something like NetplayManager (or whatever name.)  This is the class that will handle the ACTUAL networking calls, similar to AsyncIOManager.

2. On NetInit and NetShutdown, tell the NetplayManager to start and stop, kinda like in sceIo.cpp.

3. The trickiest part will be HLE that need to do multiple network operations in sequence, but we don't have to move everything at once.  I'd start with one of the simpler calls that just does a single network operation - this will be easier to offload.

4. Instead of `send(sock, data, sizeof(data), 0);` call something like:
   ```c++
   if (!netManager.Send(sock, data, sizeof(data), __KernelGetCurThread())) {
       __KernelWaitCurThread(WAITTYPE_NETWORK, sock, 0, 0, false, "net send");
       // Or just constantly have an event like cwcheat?
       CoreTiming::ScheduleEvent(usToCycles(100), checkNetworkEvent, __KernelGetCurThread());
   }
   ```

5. After that, you're pretty much done with the HLE call.  Inside the netManager, you'll need to put the operation into a queue so the runner can pick it up.  There are many examples of this, including GLQueueRunner, AsyncIOManager, etc.

6. If it's a non-blocking mode socket, it can immediately call `send(sock, data, sizeof(data), 0);` and if this returns `sizeof(data)` then you are done, no queuing needed.  In my above code example, this would return false so the thread would not be made to wait.  If it sends partially, this would need to be told to the queue so it can send only the rest.

7. When the send is complete (on the other thread) it would flag it, but NOT directly call `HLEKernel::ResumeFromWait()`.  Only the emulator thread can call this.

8. Instead, this is why we need to schedule a CoreTiming event.  That runs on the emulator thread.  It would ask `netManager` if the operation was done, and if so, it would call `HLEKernel::ResumeFromWait(threadID, WAITTYPE_NETWORK, sock, result);`.  If not done, it would schedule another event to check again later.  Careful: schedule too often, and you'll slow down emulation, schedule too inoften and there may be network lag.

9. At this point, simple send/recv funcs should work.  The next step is "layering."  Instead of calling ResumeFromWait always, the event will need to get smarter.  I recommend organizing this like `IoAsyncFinish`.  If another step in the operation is needed, you'll just keep the HLE thread waiting and tell the manager to do more.

10. I haven't noted it above, but most likely you'll need some `std::map` or similar of state per thread.  In theory, there should only be one blocking network operation per thread, so you can store the current state (like if you're doing multi-stage network requests) per thread and just clear it at the same time as waking the HLE thread.

If you skip 6 above, it would in theory even be fine to keep the requests all blocking - but it might be better to use select(). I don't know if games make multiple network requests concurrently from multiple HLE threads.

-[Unknown]

Thanks, i was trying to understand how sceIo reading works (which usually have blocking behavour just like socket's recv function), i guess this is the correct procedure.

And it seems Jpcsp also simulate blocking behavior using internal Wait type (JPCSP_WAIT_NET)

// Wait types
    public static final int PSP_WAIT_NONE = 0x00;
    public static final int PSP_WAIT_SLEEP = 0x01; // Wait on sleep thread.
    public static final int PSP_WAIT_DELAY = 0x02; // Wait on delay thread.
    public static final int PSP_WAIT_SEMA = 0x03; // Wait on sema.
    public static final int PSP_WAIT_EVENTFLAG = 0x04; // Wait on event flag.
    public static final int PSP_WAIT_MBX = 0x05; // Wait on mbx.
    public static final int PSP_WAIT_VPL = 0x06; // Wait on vpl.
    public static final int PSP_WAIT_FPL = 0x07; // Wait on fpl.
    public static final int PSP_WAIT_MSGPIPE = 0x08; // Wait on msg pipe (send and receive).
    public static final int PSP_WAIT_THREAD_END = 0x09; // Wait on thread end.
    public static final int PSP_WAIT_EVENTHANDLER = 0x0a; // Wait on event handler release.
    public static final int PSP_WAIT_CALLBACK_DELETE = 0x0b; // Wait on callback delete.
    public static final int PSP_WAIT_MUTEX = 0x0c; // Wait on mutex.
    public static final int PSP_WAIT_LWMUTEX = 0x0d; // Wait on lwmutex.
    // These wait types are only used internally in Jpcsp and are not real PSP wait types.
    public static final int JPCSP_FIRST_INTERNAL_WAIT_TYPE = 0x100;
    public static final int JPCSP_WAIT_IO = JPCSP_FIRST_INTERNAL_WAIT_TYPE; // Wait on IO.
    public static final int JPCSP_WAIT_UMD = JPCSP_WAIT_IO + 1; // Wait on UMD.
    public static final int JPCSP_WAIT_GE_LIST = JPCSP_WAIT_UMD + 1; // Wait on GE list.
    public static final int JPCSP_WAIT_NET = JPCSP_WAIT_GE_LIST + 1; // Wait on Network.
    public static final int JPCSP_WAIT_AUDIO = JPCSP_WAIT_NET + 1; // Wait on Audio.
    public static final int JPCSP_WAIT_DISPLAY_VBLANK = JPCSP_WAIT_AUDIO + 1; // Wait on Display Vblank.
    public static final int JPCSP_WAIT_CTRL = JPCSP_WAIT_DISPLAY_VBLANK + 1; // Wait on Control
    public static final int JPCSP_WAIT_USB = JPCSP_WAIT_CTRL + 1; // Wait on USB
    public static final int JPCSP_WAIT_VIDEO_DECODER = JPCSP_WAIT_USB + 1; // Wait for sceMpeg video decoder

// Polling period (micro seconds) for blocking operations
    protected static final int BLOCKED_OPERATION_POLLING_MICROS = 10000;
...
public void blockCurrentThread() {
            long schedule = Emulator.getClock().microTime() + BLOCKED_OPERATION_POLLING_MICROS;
            Emulator.getScheduler().addAction(schedule, this);
            Modules.ThreadManForUserModule.hleBlockCurrentThread(SceKernelThreadInfo.JPCSP_WAIT_NET);
        }

anr2me commented 4 years ago

aha got it that makes sense we can implement it then by just implement that multiplexer and does not care with how many opened fd under as long its still connected and take all the other distraction outside.

one thing to handle is flushing the data outside and keep track with packet on queue and belong to which fd and we should discard if that happens. this can be tricky thought i think i can make it worked.

packet queue manager is absolutely a must here to handle that distraction later. i wonder if this can be implemented on real psp hardware also but looks like a good approach to tackle this problem.

Also remember, when reading/receiving data from UDP you'll need to read the full available data in socket's buffer at once, because the OS will discard any leftover data, you can use MSG_PEEK flag to get the exact size available to read/recv.

adenovan commented 4 years ago

yep im aware of that and handle it on amultios per packet basis , ptp also do the same thing this is where the compression kick in to reduce the latency even more on internet play and boom we got tekken working fast almost like on local network on 30ms latency.

well a lot to do here but we can refactor it to much cleaner code and start to fix more call in hle.

Anyway if you need something to test in psp side just drop me a message. so we can dig it more with auto test. i need to write some remote logger homebrew to submit all psp client request at same time when launch it together and start inspecting the behavior on many psp at once.

doing io in usb will crash psp test program if the usb is disconnected so i need some of remote pc that act as logging server on every checkpoint there to subtite the logcall to network instead of usb on host and we can inspect it better on adhoc apis.

unknownbrackets commented 4 years ago

Hm, I was under the impression that with UDP, you always got one (full) message per recv(). It could be a separate type and always read into a 64K buffer to avoid needing to peek? I assume the PSP's variant can't send packets larger than that anyway, same as UDP.

-[Unknown]

anr2me commented 4 years ago

Hm, I was under the impression that with UDP, you always got one (full) message per recv(). It could be a separate type and always read into a 64K buffer to avoid needing to peek? I assume the PSP's variant can't send packets larger than that anyway, same as UDP.

-[Unknown]

Since you need to tell recv/recvfrom() the size you wanted to receive, peek will be needed to prevent unexpected issue. If the size is smaller than available data (available data size can be smaller than socket buffer size, which depends on the sender) leftover data will be discarded, if the size is bigger than available data recv/recvfrom will return an EAGAIN/EWOULDBLOCK error (non-blocking) or will wait for more data (blocking) before returning. However, this (losing leftover data behavior) only affect UDP, while TCP doesn't have this behavior so we can take any amount of data from TCP socket's buffer without worrying to lose data, just like reading from a file.

unknownbrackets commented 4 years ago

Really? https://pubs.opengroup.org/onlinepubs/009695399/functions/recvfrom.html

I would expect passing a 64K buffer to recvfrom would just always succeed. What error code does it give?

If you're trying to reduce latency/audio impact/etc. you probably want to reduce total syscalls, if possible.

-[Unknown]

adenovan commented 4 years ago

if we on local passing the 64k data and recv it has no problem. its depend on the interface i think because its often fail if the buffer size larger than mtu in internet play in case of sending.

but in receiving something that can worked but not sure about the performance penalty if we tend to pass large buffer in each call to recv.

how ppsspp handle the syscall im not familiar with that code yet probably can do that also after understanding how its work.

unknownbrackets commented 4 years ago

Well, when I say syscall I'm talking about OS (Linux, Windows, Android) overheads. It'd have to be benchmarked, but using a fixed 64K buffer and memcpy()ing out of it may be faster than making two syscalls.

https://stackoverflow.com/questions/5103282/are-repeated-recv-calls-expensive

-[Unknown]

unknownbrackets commented 2 years ago

I know some work was done toward using non-blocking networking calls. Has this improved?

-[Unknown]

hrydgard / ppsspp

Audio stutters when networking has latency (needs to be implemented using non-blocking networking) #10832