Open abstrakt8 opened 3 years ago
Given the following, I'm not surprised that lazer is giving a hit window of 60ms:
Not sure where the 59.5ms figure cited above is coming from.
Not sure where the 59.5ms figure cited above is coming from.
Took the value from https://osu.ppy.sh/wiki/en/Beatmap_Editor/Song_Setup#overall-difficulty . When hovering over the beatmap difficulty stats in stable you can also see the hit window for 100
being +-59.5ms
. Of course it could be implemented differently internally.
this may be an intentional fix, for what it's worth. will need further discussion.
The .5
figure is coming from the fact that osu!stable uses a strict <
instead of lazer's <=
. That is to say, in osu-stable terms, <60
is equivalent to saying <= 59.5
(or a timing window of 59.5ms) because the resolution is 1ms. Thus it's also song select's way of showing that this is the case as well.
Of note: In osu-stable, osu!mania uniformly uses <=
, and I believe osu!taiko also uses <=
for the normal hit (strongs are still <
).
This could be fixed by subtracting 1 from these values, however I propose we keep them as-is and touch upon this in the future. I've labelled this issue as such.
The compatibility is one thing, but the replay playback errors are another. LegacyBeatmapEncoder
applies a rounding operation to replay frames:
This means that, for instance, with a timing window of 60ms, if you hit a note at +59.6ms, it will be a 300 in the original play but then will change to +60ms in the replay and thus become a 100. In that case maybe frame times should be truncated/floored when being written out to replay?
Probably not floored, because it works both ways - floor(-59.6) == -60
. But truncation sounds reasonable.
Probably not floored, because it works both ways - floor(-59.6) == -60
. But truncation sounds reasonable.
Edit: I guess it doesn't matter since we're not dealing with relative time values... But that means I don't think just flooring is enough to avoid replay playback errors. You'd want to ceil
early hits.
relevant wiki page https://github.com/ppy/osu/wiki/Gameplay-differences-from-osu!stable#hit-window-edge-calculations-dont-match-stable (based on a quick glance of the discussion here, it might be missing some details for specific hit windows?)
In the above wiki pr I've done some investigation about how stable does hit window comparisons. In summary:
Ruleset | Comparison (stable) | Comparison (lazer) | Comparison (lazer replay) |
---|---|---|---|
osu! | abs(round(hit error)) < floor(hit window) |
abs(hit error) <= hit window |
abs(round(hit error)) <= hit window |
osu!taiko | abs(round(hit error)) < floor(hit window) , except for the miss window which uses <= |
abs(hit error) <= hit window |
abs(round(hit error)) <= hit window |
osu!mania | abs(round(hit error)) <= floor(hit window) |
abs(hit error) <= hit window |
abs(round(hit error)) <= hit window |
In short:
<=
is used, but larger by 1 ms compared to stable wherever <
is used. (Assuming the hit windows have the same value in stable and lazer, of course.) The max hit error can be up to 0.5 ms smaller or largerI haven't been able to verify this in-game though as I don't have the tooling or time to do so. Maybe there's a relevant test scene in lazer, or if not maybe one of those could be cooked up (comparing against stable rounding/truncation)?
LegacyBeatmapEncoder
applies a rounding operation to replay frames:
Rounding the hit time for replays should be correct, but I'm not entirely sure what that implies. Are these rounded hit times used for replay playback of plays set on lazer/stable or only for .osr exports?
I believe osu!taiko also uses
<=
for the normal hit (strongs are still<
).
I haven't been able to find anything of the sort in the stable reference code.
In stable, taiko misses are given here, the comparison for which (<=) can be seen here and here (variable name for the "miss window" is a bit misleading). Greats and Ok's are compared with <. There's no difference between regular and large taiko notes from what I can tell.
You'd want to ceil early hits.
Both stable and lazer use absolute hit error, so it's always comparing two positive values. There's no special case to consider here, especially since hit times are rounded in stable anyway.
Where timeOffset
is the hit error (positive or negative, but made positive above). This is the same in stable (example).
Rounding the hit time for replays should be correct, but I'm not entirely sure what that implies
Both stable and lazer use absolute hit error, so it's always comparing two positive values. There's no special case to consider here, especially since hit times are rounded in stable anyway.
There are two issues presented here: one is the hit error as you're playing the game, and the other is the timing error of replays. Legacy replays only store integer time deltas whereas lazer deals with floating-point time, so lazer rounds the time values to write into the replay.
Suppose you have a hitobject at 500ms. The user hits a circle at -59.6ms delta lazer-time (i.e. 440.4ms). Should this value be rounded up or down for storage in the replay. Conversely, suppose the user hits at +59.6ms, should this value be rounded up or down? My proposal is to ceil the former and floor the latter, that way even though there's still up to 1ms discrepancy it will always be the correct judgement.
There's no difference between regular and large taiko notes from what I can tell.
I believe your assessment here, and for the other hit results, is correct.
Suppose you have a hitobject at 500ms. The user hits a circle at -59.6ms delta lazer-time (i.e. 440.4ms). Should this value be rounded up or down for storage in the replay. Conversely, suppose the user hits at +59.6ms, should this value be rounded up or down?
Rounding the value emulates the rounded-integer temporal granularity of stable. With that, the only difference for replays in lazer (with integer time values) is that they're still compared against doubles, non-floored hit windows (as well as the <
vs <=
thing). Using ceil and floor (instead of round) would only cause yet another difference between stable and lazer.
This affects the lazer part of the examples I presented I suppose. I've updated them with lazer replay timings and also made it all into a table.
even though there's still up to 1ms discrepancy it will always be the correct judgement.
Well I thought through the implications of this... (see updated examples section) but I can't convince myself that this would solve anything.
In Stable: +59ms => Ok, +60ms => Meh In Lazer: +59ms => Ok, +60ms => Ok which is incorrect.
Reading through this again, I see two options:
This one is probably too technical for community discussion so we'll probably have to come to a conclusion internally. @ppy/team-client anyone wants to sound off on this, or is this going to be a matter of "try both and see which works best"?
I haven't been able to verify this in-game though as I don't have the tooling or time to do so. Maybe there's a relevant test scene in lazer, or if not maybe one of those could be cooked up (comparing against stable rounding/truncation)?
is there a test scene for this? it would be invaluable for both outlined options
I'm not sure how such a test scene would look because it would require... including stable logic? Or I guess capturing some ground-truth stable replays and seeing how lazer handles them.
For now I mostly tagged to try and get some feelers out on the technical direction, to see if there are any preferences. Chances are that development on this task is gonna entail writing that test scene, then both variants, then choosing one.
I prefer making alterations to HitWindows
if possible. Reasons:
The question I'd then have is: will this break backwards compatibility (playing lazer replays on stable) and is that something we even care about?
I'm also not against the other proposal if it turns out this is too difficulty to implement in a nice way.
will this break backwards compatibility (playing lazer replays on stable) and is that something we even care about?
I can't say if this specific proposed change will break anything, and that's probably to be tested at the implementation phase, but our backwards compatibility of replays is already pretty jank if I'm honest. Like, for one thing, we're still using the legacy format, but also allowing scores with lazer-only mods to be exported to that old format, so if a lazer mod is anything more then visual, then it can play in a completely broken manner on stable. It only sorta kinda works because we're appending some special lazer secret sauce that stable can't read.
As to whether the above should be fixed (not by porting mods over, obviously, more like something like stable reading the lazer secret sauce and loudly refusing to do anything with the replay), no idea, and out of scope here.
I think with the general feelers out of the way there's not much to discuss here further until someone actually takes this on, tries to implement either option, and sees what happens.
Changing ResultFrom
in HitWindows
to virtual and overriding this method for the specific ruleset, according to the table sent by @Walavouchey, fixed the discrepancy in the replay.
In particular for OsuHitWindows
the change was
public override HitResult ResultFor(double timeOffset)
{
timeOffset = Math.Abs(Math.Round(timeOffset));
for (var result = HitResult.Perfect; result >= HitResult.Miss; --result)
{
if (IsHitResultAllowed(result) && timeOffset < Math.Floor(WindowFor(result)))
return result;
}
return HitResult.None;
}
Is this valid solution for this or am I missing something.
I also send the branch I'm working on, or should I make PR or draft PR. Obviously it needs more testing for osu and other rulesets.
@Wleter if you ask me then any attempt at this should likely include test cases in code and a comprehensive cross check against stable, which you currently don't have.
https://github.com/ppy/osu/pull/24636 also materially affects this (also touches mania hit windows). The two need to work together correctly.
We were also thinking that possibly this hit window interval clusivity thing should be a ctor argument instead (see relevant discord conversation).
I'd say that unless you're willing to put the effort into testing multiple approaches and testing the results as already discussed above it's probably best left to one of us given that we have access to stable source.
@bdach sure I'll leave it to you since you have stable source.
Thinking about this from a practical perspective, I do wonder about floating point precision on time values. We are already capping the game (input) to 1000 hz, and I do not foresee that changing. So the real question is how much value there is in keeping double precision in the first place.
After rounding, the maximum error with int
based timing is 0.5 ms, which is arguably well within expectations of precision (the tightest timing in the game is ±11.5 ms for an osu!mania MAX with HR applied, making the precision around 2% of this judgement).
I may want to rescind my opinion above and go with the simple approach here..
I may want to rescind my opinion above and go with the simple approach here..
I'm not sure I can judge which approach is the simple one without trying both if I'm honest with you.
Simple in terms of "matching stable means we don't have to track these discrepancies across multiple implementations". But I guess we are using double
everywhere for time already so I can see what you mean.
I don't think this is something that is a blocker for the path-to-ranked project since it mostly affects (very few) replays and is otherwise unnoticeable in gameplay. We can fix this at any point later on, with likewise minimal effect to gameplay.
Adding another sample, just in case (same settings as https://github.com/ppy/osu/issues/25973)
Beatmap
Gameplay score
Replay score
Replay file
since it mostly affects (very few) replays
As high-level osu!mania gameplay is inherently reliant on tight timing windows, I think this is probably bound to happen more often than not for osu!mania replays. I don't think it's a blocker for ranked play though, as long as the gameplay score (not the replay one) makes sense.
This does affect both gameplay and replays FWIW
Bumping this to p0 until i or someone else investigates.
As noted, currently lazer hit windows are too lenient for standard and taiko, and too strict for mania, compared to stable hit windows. For OD 10 in standard, the OD hit window formula gives 20 ms for the 300 hit window. In reality, true hit error in stable, as measured by the game's clock, can be in the interval (-19.5 ms, 19.5 ms) to get a 300, as described by the stable comparison formula. In lazer, a new, more intuitive formula is used, which results in the interval of [-20 ms, 20 ms], a 0.5 ms difference, a non-trivial reduction in difficulty. The effect is even more pronounced for fractional ODs, as stable floors the hit window, while lazer doesn't, resulting in up to 1.3-1.5 ms of difference. The disparity disadvantages existing stable scores in standard and taiko, and advantages existing stable scores in mania. This means that if an obsoleting of stable scores isn't desired, then these stable's hit windows must be matched in lazer, and thus the intuitive formula sadly cannot be followed as it is.
This also means that 20 ms that is obtained from the OD hit window formula is no longer the "true" hit window - it's actually 19.5 ms. If the player asks for the actual hit window (eg. by hovering over the OD value in song select), it would therefore be best to return 19.5 ms, and not 20 ms. As a side note, it is the "true" hit window value that gets divided by speed rate, not the result of the OD hit window formula (except for mania, of course).
There is a disadvantage to rounding player inputs to integers - it introduces a small amount of additional variance not due to the player's input variance. This play, set on stable, has the UR of 28.84. Had it been played on lazer, it would be expected to have about 28.55 UR. This effect is less pronounced the higher the UR. Thus, it is preferable to use double
for hit error.
This also means that downgrading double
to int
almost always changes the UR of the replay by a slight amount. This can be observed in Paturages's comment above. The UR value changing between gameplay result screen and replay result screen, while only changing slightly, could be confusing. The gameplay UR should be the one considered correct, not the replay one. That's an issue for another thread though.
An idea to resolve the issue would be to find a formula to use as the judgement formula which is equivalent to stable's, and works the same regardless of whether input is rounded (or transformed in another way, as we can modify LegacyScoreEncoder
) or not, so that outputting legacy replays works. This is possible, and without rounding hiterror
. The formula is abs(hit_error) < floor(hit_window) - 0.5
for standard and taiko (exclusive hit windows), and abs(hit_error) < floor(hit_window + 1) - 0.5
for mania (inclusive hit windows). They are equivalent to stable's for any arguments, except hit_error half-integer. This formula also has a nice property of just having the "true" hit window formula on the right hand side and just hit error on the left hand side.
What about the hit_error half-integer case? Stable uses Math.Round(), which uses a to-even tiebreaker method. Of course, the half-integers that aren't close to the "true" hit window edge work as expected for both stable's and the proposed formula. However, in stable, half-integers at the edge can be either judged as in or out, depending on the OD. The proposed formula simply considers them all out, regardless of OD. It's not impossible to support to-even half-integer rounding in the judgement check method, but it's pointlessly complex just to not make hitting accurately harder by less than a nanosecond, and the new formula is more consistent. I would say half-integer hit errors not resulting in the same hit result in gameplay as in stable is a non-issue. It's only meaningful in the context of legacy score encoding (and even then, it's a very rare issue).
Not sure if I'm understanding legacy score encoding correctly, but from looking at the linked code it seems to me that it's currently not working properly. For example, on osu!mania with OD 10, a hit that is 34.4 ms late would be a Good on live lazer, but rounded, it would be a hit error of 34 ms, which, when replayed, would be a Great. This would be fixed with the new proposed formula (by rounding, and with away-from-zero (in terms of hit error) tiebreaker to handle the half-integer case).
Unfortunately, from what I can see, LegacyScoreEncoder
operates on replay frame time values, not hit errors. We would need to know the time of the object being hit (then, with the new proposed formula, we could find the correct integer to downgrade to). As it is, a 34.4 ms early hit would be x.6 in beatmap time, and thus change the rounding direction compared to a 34.4 ms late hit. Because of this, it doesn't seem to be possible to round in the proper direction with just the replay frame time value. Because of this, we would need to know the time of the object being hit to correctly round half-integer hit errors (again, it's a very rare case, but once in a blue moon this might create a mismatch between gameplay and legacy replay).
After this issue is resolved, previously set lazer scores might have higher accuracy than they should, if there were hits at the edge (for example, hits in the [19.5 ms, 20 ms] range on OD 10 in standard). Unfortunately, it's either them or all the historical scores from stable that will be inaccurate, so the choice seems clear. This is why I would say this is actually an important issue to get right. Correcting ruleset rules after scores had been set on the leaderboards and pp awarded could be difficult.
I've submitted a PR that aims to resolve the issue using this approach here.
Stable uses Math.Round()
Out of curiosity and concern, how are you checking against stable when the code is not published? Usually we'd have people ask for the relevant extract of code before making statements like this.
I claimed that based on Walavouchey's investigation with stable reference code, the fact that Math.Round()
is the method one would normally use to round in C#, and that indeed that's what was used in LegacyScoreEncoder
.
Your comment made me think more about the rounding, and its relation to legacy score encoding. The part of the formula in Walavouchey's table responsible for the hit error reads simply: round(hit_error)
. But this, in fact, hides one more operation, since hit error = hit time - object time
. Therefore, in stable, rounding could actually be applied in either of two places (or even both, but for stable, that would be equivalent to (II)):
hit_error = round(hit_time - object_time)
, orhit_error = round(hit_time) - object_time
.Reading Walavouchey's table, I assumed (I) was the used formula, and my remarks about legacy score encoding in the previous comment were based on that assumption. But in fact, the used formula could also be (II), and if that's the case, then the legacy score encoding can be rounded in the correct direction with just the replay frame value, as from that formula it can be seen that rounding the time value is correct to downgrade to int in legacy score encoding, as it basically repeats the same operation.
If object_time
is an integer, and osu!stable hit object times fortunately are integers, these are equivalent (proof omitted). I was wrong in the "Unfortunately, from what I can see {...}" paragraph. The rounding direction does change, but that's the correct behavior to downgrade a fractional value to a symmetrical interval of integers.
However, this is the correct way to downgrade only if the judgement formula is equivalent to stable's, and my '34.4 ms late hit' point still stands. For example, in mania, a 16.4 ms off (either way) hit would be Perfect in stable, and in lazer with my PR, but would be a Great in live lazer (if Perfect hit window was 16 ms), while being rounded to 16 ms during legacy score encoding, which would be a Perfect.
Moreover, in lazer, the object time is a double. The two formulas are not equivalent for fractional object times. For double
object times, the legacy score encoding will work correctly only if stable followed the (II) formula. Although I would not expect stable to be able to replay lazer's double
object time maps anyway.
I'm a bit confused how it's possible that a replayed osu!mania lazer play doesn't have the exact same result counts? https://github.com/ppy/osu/issues/25973 was closed with a comment pointing to LegacyScoreEncoder
code, and I don't quite see how that's relevant. Does immediately replaying an osu!mania lazer play somehow use legacy encoded replays?
I'm a bit confused how it's possible that a replayed osu!mania lazer play doesn't have the exact same result counts? #25973 was closed with a comment pointing to
LegacyScoreEncoder
code, and I don't quite see how that's relevant. Does immediately replaying an osu!mania lazer play somehow use legacy encoded replays?
Reading through Player.cs
it seems that legacy rulesets indeed use legacy replays in lazer. That's surprising to me, I thought lazer would have its own new, more accurate replay format.
If anyone was wondering, here's how the results are counted differently in Paturages's mania replays:
Maboroshi: the map is OD 8.3. live lazer's Perfect hit window is 15.77 ms, and Great hit window is 39.1 ms. In stable, they're 16.5 ms and 39.5 ms. In gameplay, hits that are off by x in [15.5, 15.77] ms are Perfect, but in the replay they get rounded to 16 ms and are thus Great. This is why the replay's Perfect count is lower. On stable, the Perfect count would be even higher than in lazer gameplay, and so would the sum of Perfect and Great counts. Neither lazer gameplay's nor replay's result counts are correct from stable's point of view. Counts in lazer with my PR are 2117/418/23/1/0/2, which looks legit.
The Weekend: the map is OD 8; live lazer's Perfect hit window is 16.1 ms, stable's is 16.5 ms. The rest of the hit windows are all 0.5 ms wider in stable because of rounding. In gameplay, hits that are off by x in (16.1, 16.5] are Great, but in the replay they get rounded to 16 ms and are thus Perfect. Similarly, all the other results have 0.5 ms more lenient hit windows in replay and stable, and this is why the replay has higher accuracy than the gameplay across the board. The hit result counts in the replay are actually the correct result counts from stable's point of view, as the judgement formulas are equivalent, and legacy score encoding rounds the same way as stable. I get the same hit results counts in lazer with my PR.
I'm deprioritising this from p0. https://github.com/ppy/osu/pull/26452 exists but it hasn't been reviewed by anyone but me despite being open for months and judging from references to it from elsewhere we might not be interested in doing anything about it at all. Unless I'm wrong @ppy/team-client?
Describe the bug:
For the
OsuRuleset
(did not research/test for other rulesets) I have found a bug as described in the title.Example:
Given OD=10, which means the hit-window for
Meh
(50) is supposed to be +-59.5ms:In Stable:
+59ms
=>Ok
,+60ms
=>Meh
In Lazer:+59ms
=>Ok
,+60ms
=>Ok
which is incorrect.This is demonstrated in one score of mine:
Side by side comparison: https://www.youtube.com/watch?v=IQGsfzOEyJY
I used gosumemory to calculate the hit offsets of this replay and at the suspicious note (=6th last note of the map) I hit the note with exactly +60ms offset. (https://gist.github.com/abstrakt-osu/46997fa9d203564ed1f1b1eee09a18ba#file-gosu-json-L894).
Screenshots or videos showing encountered issue:
Side by side comparison: https://www.youtube.com/watch?v=IQGsfzOEyJY
Beatmap: https://osu.ppy.sh/beatmapsets/1010865#osu/2117132 Score (+Replay): https://osu.ppy.sh/scores/osu/3321687146
Stable replay: https://youtu.be/FDhGuZh3P_U Lazer replay: https://youtu.be/Ositc3vu_bg
osu!lazer version: 2020.1225.0
Logs: