libretro / RetroArch

Cross-platform, sophisticated frontend for the libretro API. Licensed GPLv3.
http://www.libretro.com
GNU General Public License v3.0
10.17k stars 1.82k forks source link

Netplay now desynchronizes easily [bisected] #6309

Closed furiadeoso closed 6 years ago

furiadeoso commented 6 years ago

First and foremost consider this:

Description

BACKGROUND: Retroarch NETPLAY initially (at least until version 1.7.0) is ready to work without needing to check savestates during the gameplay. Check frames was an added feature because many cores or drivers could have a bug that caused desynchrony. However, each recovery through the check frames can cause a bad user experience due to unexpected changes in the state or due to involuntary inputs. This can be seen in real gameplay situations on internet and, above all, with more than 2 players and / or when it is necessary to add a bit of latency to avoid stuttering.

STATUS IN STABLE VERSIONS Some cores like fbalpha, in most of their drivers no longer have bugs, and allow to netplay without needing to be continuosly recovering savestates (check frames). The netplay was very solid and allowed the check frames feature to work for what was actually created: only as safe in case something unexpected happens badly.

STATUS FROM THE LAST CHANGES IN THE RA NIGHTLY CODE Now, even using the same core (FBAlpha) and roms (several different ones have been tried) that has been used in 1.6.9 and 1.7.0, in nightly versions, after netplay code changes, it is very easily be de-synchronize in situations of real netplay on the internet. The check-frames feature fixes the desynchrony, but as it happens by default every 10 seconds, big changes of state are seen in the game. However, reducing check-frames would probably only make the user experience worse.

Expected behavior

Simply the expected behavior is that the possibilities of desynchronizing are equal to those of the stable versions. Thanks to the state of netplay in the stable versions, It was already somewhat dependent on the cores mostly.

Actual behavior

The RA code itself for some unknown reason makes it easy for desynchronization to occur. And it is independent of the core, since the same core has been tested in the stable versions.

Steps to reproduce the bug

  1. Meet 3 or 4 friends on a real netplay session over internet and use a 3 or 4 players game on FBAlpha core.
  2. Client sides must compare their experience between RA last nightly version and last stable versions. In most cases, every 10 seconds, the second / third / fourth player will check a big change of state every 10 seconds, which makes the netplay totally impracticable. In any case, much worse to the situation of the previous stable versions.

Bisect Results

Las Good commit: https://github.com/libretro/RetroArch/commit/924007eda2d021ebb107f6051c26ea50db43b640

The issue is introduced in: https://github.com/libretro/RetroArch/commit/4a6a97be60449f97d84631e029f0496b00afbb23

Version/Commit

You can find this information under Information/System Information

Tested in: STABLE versions: 1.6.9 (OK) 1.7.0 (OK) 1.7.1 (BAD) commits: 924007eda2d021ebb107f6051c26ea50db43b640 (OK) 4a6a97be60449f97d84631e029f0496b00afbb23 (BAD)

Environment information

inactive123 commented 6 years ago

@GregorR Could you look into this?

GregorR commented 6 years ago

I just did a quick test of fbalpha with kod.zip in both 1.6.9 and master. In neither is fbalpha sufficiently deterministic to stay in sync. I get desyncs easily in both.

Perhaps the previous default check_frames value, 30, was sufficiently snappy to present the illusion of the game remaining in sync. If you preferred it, feel free to set it back. Many have complained about that value, so I backed it off, and now there's a new value to complain about.

furiadeoso commented 6 years ago

@GregorR It is strange that this happens to you with the KOD.zip . Because in my community we usually play using the value 0 of check frames because we had problems with the low values of check frames. And a lot of fba games work without desynchronization. You can see the games in our channel where we even broadcast the games at the same time we play online with 1.6.9: https://www.twitch.tv/emupartidas/videos/all

What I wanted to say in this issue is that we have detected a greater facility for desynchronization to occur from the new code. But independently of the value of check frames.

Put it (check frames default value) back under, if RA desynchronizes more easily, I do not think I'll solve it. I do not know if I explain myself well in English :)

GregorR commented 6 years ago

While it is of course possible that new bugs have been introduced, the change to netplay was to support a wider variety of input devices and to better support the mapping of input devices to clients. As such, the kind of bugs I expect are "everything breaks, the input doesn't go to the right place, wtf is happening", not desyncs.

Further, the fact that you had issues with low check_frames values indicates that desync was happening. check_frames does nothing if there is no desynchronization. It's harmless, although pointless, to set it to 1 with a core that does not desync, such as snes9x. Perhaps the nature of the desync is itself a bit unpredictable (well, certainly it is), and if it is unpredictable, the fact that you're experiencing more visible circumstances of desync is just poor luck. Perhaps you were desyncing and simply didn't notice before. Regardless the problem remains the problem: This core is not deterministic, and no amount of netplay cleverness (or bugginess) will change that.

furiadeoso commented 6 years ago

Check frames = 0 is not equivalent to TURN OFF check frames feature? Because we are more or less 20 different people playing and using normally ZERO as check frame value.

GregorR commented 6 years ago

check_frames = 0 turns off the feature, yes.

Relevant question: Are you playing over local network, over a fairly close distance, or are you worldwide? The only other relevant change is that the initial connection handshake is now larger (the initial connection takes slightly longer) as it involves device negotiation, so it's a bit less likely that close connections will be in lock-step just due to coincidence, but this is in the strange vagaries of network latency rather than anything to do with RA.

furiadeoso commented 6 years ago

The live real test have been Worldwide. 3 players. I acted as a host first. Ping to me aprox: player 2 and 3: between 25-35 ms . They said that was having big "jumps" each 10 seconds. I turned off check_frames and players 2 and 3 maintain the same state (they said have the same points in the game) but I (host) have different.

After that the second one acted as the host. I expeimented the same "jumps" each 10 seconds.

We tried the same (as I do every day in my community) in 1.6.9 to check it was happening someting strage with our connections in that moment. But It was perfect. KOD and DINO.zip

I don't undestant at all your last phrase, sorry :)

furiadeoso commented 6 years ago

Ah. I think we put some latency. At least range. Maybe 0 (min) - 2 (range) or 1 (min) - 3 (range). For playing with 3 is always necessary.

GregorR commented 6 years ago

Here's the problem:

(1) fbalpha is nondeterministic. It can and will cause desyncs, however,

(2) when and why nondeterministic cores desync is unpredictable, as it's a result of minute differences in timing. If the host happens to be ahead by a frame in one session and behind by a frame in another, entirely different things will happen. It's not at all unusual on a stable connection to get the same timing over and over again by dumb luck.

(3) It is certainly the case that the new changes affect timing, since everything affects timing, however,

(4) it's not possible to hunt down every minor timing change, since we're chasing accidental behavior, and what accidentally works for you won't accidentally work for everyone.

Input latency certainly increases the chance of accidentally not seeing desync as well; if the connection goes quickly and smoothly enough, nobody will ever load any state, so there's nothing to desync.

If you want to play with a nondeterminstic core and not desync, you MUST use stateless mode. With input latency it's possible in theory for stateless mode to be playable over a distance.

furiadeoso commented 6 years ago

What is the difference between activating the stateless mode (I never tried it) and setting check frames to 0?

GregorR commented 6 years ago

Setting check_frames to 0 just never checks if you're in sync. It's the wild west; if the core desyncs, too bad.

(If you've set check_frames to 30 and things seemed to go awry, that's because you were going out of sync, whether you could actually see it or not.)

stateless_mode is wholly unrelated. In stateless mode, it never progresses unless it has input from everyone, so there's never a need to rewind and load a state with new input. Since nobody ever loads a previous state, most cores won't desync even if they do desync in normal operation. Most :)

My speculation is that with 30ms latency and 3 input latency frames (48ms), you weren't desyncing because you were never actually doing any rewind-based netplay at all. But without forcing it so, that's a coincidence; update one driver, break one line in the network, play while somebody in the other room is watching HD porn, and bam, the extra 18ms latency pushes you over and desyncs occur. If it looks like it happens reliably with the new version, it's only because the slight changes to the connection seem to push you over that (very narrow) edge.

furiadeoso commented 6 years ago

So, if I understood well. Activating stateless mode as a host I force a "kaillera-like" mode, where I wait all the inputs and all the players await for them to progress, isn't it? So if all the players have enought latency the game can look smooth too.

GregorR commented 6 years ago

Yes. Stateless mode is exactly kaillera or mednafen mode.

furiadeoso commented 6 years ago

Ok. Thanks. I'm going to close the issue and we'll continue testing. The only thing that has left me perplexed is to read that fba behaves indeterministically, because we really have been playing and speaking by discord for a couple of months and we are aware of it. and only desynchronize in concrete games like metal slug. some have been very long gameplays.

furiadeoso commented 6 years ago

@GregoR Only one thing more, Can you check https://github.com/libretro/RetroArch/issues/6298? I tried cli commands and I think check_frames may not been having priority over the menu setting. Thanks

furiadeoso commented 6 years ago

@GregorR I have bisected the issue (See description) and have done more tests. In my opinion, there is a big difference after that commit and I think it is necessary that the netplay become a being as stable as before. The problem could be masked by diminishing the check frames default value, but I suspect that this would not be a satisfactory solution for most users. For my part, as an end user, for the time being it is not a viable option to use 1.7.1 for netplay. I encourage you to go deeper into the matter and, in that case, I am at your disposal to perform more tests as a user.

I think it would be useful if also more users show their experiences here, of course.

GregorR commented 6 years ago

After a bit of squirrely testing, I think I have a theory. First I want to reiterate that the bug is in the core: It is nondeterministic.

It seems that fbalpha behaves somewhat-almost-deterministically if your connection is OK and you reload states more often than is necessary. i.e., while its behavior isn't deterministic, overeagerly loading states brings it kinda inline.

One behaviorally minor (albeit performance critical) bug was fixed in that patch, which had previously caused netplay to over-reload and be unnecessarily expensive. The fix is the block starting on line 756 of netplay_sync.c:

      if (cont)
      {
         while (netplay->replay_frame_count < netplay->run_frame_count)
         {
            if (netplay_resolve_input(netplay, netplay->replay_ptr, true))
               break;
            netplay->replay_ptr = NEXT_PTR(netplay->replay_ptr);
            netplay->replay_frame_count++;
         }
      }

Without that block in place, it's noticeably harder to make fbalpha desync (I have to put my garbage network simulator settings up higher).

Please try commenting out that block and seeing if it helps things.

However: I WILL NOT remove that block just to make fbalpha kinda-sorta-almost work sometimes. Doing so makes netplay markedly less efficient for no good reason. Clients were resimulating frames they had already simulated with the exact same input, which makes no sense. Worse yet, when clients fell behind—say, due to simulating too many frames—the bug caused them to needlessly resimulate even more frames, which could be a cascading failure on poor networks. This fix is far too important for me to remove it just to make a buggy core pretend not to be buggy at the expense of netplay on all other cores.

If you have a suggestion, I'm all ears.

furiadeoso commented 6 years ago

Thanks, Gregor. If I understood correctly, I also agree to advance the efficiency of netplay. And it can not be sacrificed for one or two cores. I'm going to try what you say to check your theory and I'll comment on my results and some suggestions. I would also like Retroarch to remain the alternative with ggpo code to play more than two players in arcade games. Therefore, I will try to be constructive, so that no one is left behind.

Is there any kind of suggestion that I could also make to the fba core developers to improve the netplay?

And really, thanks again for your work. We are having a good time thanks to that.

GregorR commented 6 years ago

Well, fbalpha needs to be made (more) deterministic. Unfortunately, that's a hugely tall order.

It's further convoluted, incidentally, by the fact that check_frames is slightly screwier on fbalpha than it deserves to be: fbalpha's savestate code is nondeterministic independently of the actual behavior of the core. That is, you can save the exact same frame twice and result in two different savestates. As a result, the desynchronization detection also doesn't work. I thought I had code to detect this case and disable check_frames so as to avoid getting swamped reloading, but evidently that code isn't there or doesn't work. I'll have to investigate it.

Given that fbalpha seems perhaps not deeply nondeterministic, if they at least made the savestate code more reliable, check_frames could be sufficient to make it usable.

andres-asm commented 6 years ago

Hmmm That is, you can save the exact same frame twice and result in two different savestates. They were including pointer in the savestates, that could be the root of this problem I guess?

But I thought that was all fixed by now.

GregorR commented 6 years ago

I haven't looked deeply into it, but no, certainly not pointers. The states are transferable over a network just fine, so that's obviously not a problem :)

Probably a timestamp or some global counter not relevant to execution or something like that is getting saved. Or, maybe the states are compressed in a nondeterministic way or something like that. The states work fine, just I can call core_serialize twice and get different ones.

furiadeoso commented 6 years ago

It is possible that I'm not now understanding everything you say 😅.

In my community we turned check_frames to 0 because in many cases there was a false identification of desynchronization and savestates were loaded continuously, causing small stoppages. However, since we use check_frames 0 and ra 1.6.9 there are hardly any problems. Seconds, third and fourth players leave and enter at any time and have a correct state. It's also good between 32 and 64 bits recently. We play 99% with fba core. I only needed check_frames in a particular game, like metal slug. I hope this info can help.

GregorR commented 6 years ago

(1) The point of check_frames is that if a core is nondeterministic but only lightly nondeterministic (i.e., it desyncs occasionally), checking frames can bring it back into a workable state. fceumm is the canonical core with this issue; if you play for a couple of hours with a friend on fceumm, it is likely to desync once or twice.

(2) Some cores, regardless of whether the core's actual behavior is nondeterministic, save states in a nondeterministic way, so that it's impossible to check whether the state of one client is equivalent to the state of the other, which is what check_frames does. fbalpha is one such core.

(3) There is code to detect (2), but it appears not to work for fbalpha, causing reload storms if check_frames is enabled, due not to fbalpha's actual nondeterminism, but its states' nondeterminism.

(4) As well as having nondeterministic savestates, fbalpha is also nondeterministic. These two facts are entirely independent of one another.

(5) As it happens, the flavor of nondeterminism in fbalpha, at least for this particular class of machines (as it's a multi-emulator), is that while states will diverge with the same input, the states will usually not diverge if you unnecessarily reload them before giving said input.

(6) A bug in netplay accidentally invoked that strange behavior in fbalpha, making it appear more reliable than it is. The netplay bug has now been fixed, as regardless of kinda helping fbalpha, it made netplay markedly slower.

(7) Making fbalpha actually deterministic would be a tall order, but its nondeterminisic savestates are entirely independent of its nondeterminism. If that was fixed, this discussion would have ended at point (1).

furiadeoso commented 6 years ago

Do you mean that, given an identical situation in two or more netplayers, could it be generating two different savestate files?

I'm not sure if this is the case with fba or some concrete fba drivers. Maybe it's something that @fr500 or @barbudreadmon know.

JordiRos commented 6 years ago

(7) Making fbalpha actually deterministic would be a tall order, but its nondeterminisic savestates are entirely independent of its nondeterminism. If that was fixed, this discussion would have ended at point (1).

I made a comment in https://github.com/libretro/fbalpha/issues/166 related to this, as furiadeoso pointed me out. I think those fixes would be enough for solution (1), as it's what GGPO/FC are doing for its rollbacks.

furiadeoso commented 6 years ago

I can confirm your theory @GregorR . When these lines are discussed, netplay with FBA becomes as stable as in 1.7.0:

      if (cont)
      {
         while (netplay->replay_frame_count < netplay->run_frame_count)
         {
            if (netplay_resolve_input(netplay, netplay->replay_ptr, true))
               break;
            netplay->replay_ptr = NEXT_PTR(netplay->replay_ptr);
            netplay->replay_frame_count++;
         }
      }

As @JordiRos said, we are going to do some tests on the fbalpha core side. Is there any method to know in local tests if they are continuously loading savestates despite being unnecessary?

GregorR commented 6 years ago

Unfortunately, testing for this kind of thing is exceedingly difficult (hence why I'd written off fbalpha as nondeterministic). Your best bet is just to put those lines back in and try netplay. If possible I'd recommend simulating bad network conditions in some way, e.g. https://github.com/tylertreat/comcast

furiadeoso commented 6 years ago

Hi, @GregoR. I am doing a lot of tests to continue advancing with this issue. Could you tell me a game of 3 or 4 players and a core that you are sure are deterministic, to compare?

It is important because perhaps the problem is occurring when there is more than two players.

furiadeoso commented 6 years ago

BTW @GregorR Could you take a look at https://github.com/libretro/RetroArch/issues/6298 ? The last time I try to use chech-frames CLI command was in 4a6a97b and it did not work. I've been looking at the recent commits and I think it has not been fixed yet. Thanks!

GregorR commented 6 years ago

Yeah, I'm aware of the CLI issue. Suffice it to say that the way CLI arguments are handled in RA is sufficiently obtuse that I haven't fixed it yet :)

Regarding "best case" code: Make sure you're on the latest version, as I have made a few patches to netplay, and the best "it does everything right" core is SNES9x. Any ≥3-player game for SNES will do.

furiadeoso commented 6 years ago

Thank you very much, @GregorR GregoR. If it is very difficult to change the CLI system, it could also be a viable option to have more accessible for the user the modification of check frames, for example, changing to a type of input text field and removing it from advanced options. Currently it's a nightmare to change for example from 600 to 0.

furiadeoso commented 6 years ago

After doing some tests, It seems to be fixed in https://github.com/libretro/RetroArch/commit/a39bff6e039e987b31d9cc39656d990149d6c963 Thanks, GregoR.