ppy / osu

rhythm is just a *click* away!
https://osu.ppy.sh
MIT License
15.28k stars 2.27k forks source link

Unexpected frame stutters at regular intervals (GC related) #11800

Closed pinembour closed 1 year ago

pinembour commented 3 years ago

Describe the bug: A few times per map I encounter a stutter that makes frame time for one frame spike. The screenshot shows the issue in mania but I encounter it in every game mode. From what I saw, nothing in the logs seems to happen at the same time as the stutter.

I can reproduce this issue with stock or custom kernel, 5.10 or 5.11, amd gpu or nvidia, built from source or using appimage.

Screenshots or videos showing encountered issue: Screenshot from 2021-02-17 09-38-32

osu!lazer version: 2021.212.0.r106.86faa7f465-1, reproductible with 2021.109.0-1 and anything in between, did not try anything earlier

Logs:

logs.zip

peppy commented 3 years ago

There's nothing to work with in this report, nor is this reproducible for us. We cannot solve per-user performance issues like this sorry.

If you happen to come up with some kind of profiling or further information please feel free to attach it to this issue thread and we will re-visit it.

Micha-ohne-el commented 3 years ago

I am experiencing the exact same issue, and I have a few things to add to hopefully help solve this:

@pinembour, what DE are you using? I'm using KDE Plasma, maybe that could be a factor (though I doubt it)? One more thing to note is that we are both using high-refreshrate monitors. Mine runs at 120 Hz, yours at 144 Hz. I tried lowering my refresh rate to 60 Hz but that didn't seem to improve things.

peppy commented 3 years ago

If you want to rule out garbage collection, you can try clicking "clear all caches" at the bottom of settings. This will force the harshest garbage collection level (which will almost never run). You can also see garbage collections as they show up as dots on the frame graph, but keep in mind they may not necessarily be the cause but the result (the runtime sees that nothing is happening so it runs a GC).

The spike will show on both graphs because draw is dependent on update. If there's no update run, there is nothing to draw on the screen so it will wait on the update to arrive. If it shows as "work" on both then it's guaranteed to be a draw-side issue (update blocking draw will show as sleep time on the draw thread). This 99.9999999999% rules out garbage collection, and points to a GPU, driver, or GL code issue.

We can't really work further without you performing profiling on your system to find what is causing the gpu pipeline blocking. What I can say is that we don't experience this on Windows or macOS, so it is likely driver or hardware specific.

Micha-ohne-el commented 3 years ago

Thank you for that crazy fast reply!

If you want to rule out garbage collection, you can try clicking "clear all caches" at the bottom of settings. This will force the harshest garbage collection level (which will almost never run). You can also see garbage collections as they show up as dots on the frame graph, but keep in mind they may not necessarily be the cause but the result (the runtime sees that nothing is happening so it runs a GC).

I did that, and just like you said, there are dots appearing on each of those spikes (red, yellow, and green ones if that matters).

The spike will show on both graphs because draw is dependent on update. If there's no update run, there is nothing to draw on the screen so it will wait on the update to arrive. If it shows as "work" on both then it's guaranteed to be a draw-side issue (update blocking draw will show as sleep time on the draw thread). This 99.9999999999% rules out garbage collection, and points to a GPU, driver, or GL code issue.

We can't really work further without you performing profiling on your system to find what is causing the gpu pipeline blocking. What I can say is that we don't experience this on Windows or macOS, so it is likely driver or hardware specific.

I used trace-cmd to do some profiling but idk how to interpret the results properly. I used gpuvis to visually look at the resulting traces, and they do show a pattern, but idk what it means. This is one stutter, for example: 2021-02-22_14-57-42 And here it is zoomed further in: 2021-02-22_14-58-41 I uploaded all three traces that I created, so hopefully you or anyone else can interpret them better: Profiling Traces.zip

Feel completely free to request any more info, I really want this to get figured out, as it's actually really annoying when playing. And please forgive me if I did something wrong, it's my first time doing something like this.

peppy commented 3 years ago

Are those after manually pressing the button in settings? I may have not been clear: that button will cause a stutter (it does for me as well, for around 10ms) because it's doing a lot of clean-up that usually would not occur. The trace output you provide may be useful if you can confirm (or provide a second set of trace output) during the actual stutters you are experiencing during gameplay or otherwise.

Micha-ohne-el commented 3 years ago

Ah, sorry. These are actual stutters recorded during normal gameplay. I didn't press that putton at all during those traces. Pressing the button causes a much larger stutter, should I record a trace for that as well, so you can compare the two?

pinembour commented 3 years ago

I'm glad I'm not the only one facing this issue, we might be able to figure something out here. Every point you added @Micha-ohne-el applies to me as well. I can also add that the time between each stutter seems to be dependent on the difficulty. For example on one song, in an easy 4k map the stutters may appear every 11s or so, but in a harder 7k map they would appear every 8s. ( I can also see a difference, although smaller, between a 4k easy and a harder 4k.)

@pinembour, what DE are you using? I'm using KDE Plasma, maybe that could be a factor (though I doubt it)? One more thing to note is that we are both using high-refreshrate monitors. Mine runs at 120 Hz, yours at 144 Hz. I tried lowering my refresh rate to 60 Hz but that didn't seem to improve things.

I am using gnome, also tried with sway so I could rule out mutter, with no amelioration.

The spike will show on both graphs [...]

I'm not sure it matters, but the spike is actually visible on all 4 graphs

Micha-ohne-el commented 3 years ago

I can also add that the time between each stutter seems to be dependent on the difficulty. For example on one song, in an easy 4k map the stutters may appear every 11s or so, but in a harder 7k map they would appear every 8s. ( I can also see a difference, although smaller, between a 4k easy and a harder 4k.)

I was thinking the same thing, but I thought I was imagining it (since the stutters don't always take the same amount of time, I thought it was just a coincidence that harder maps tended to have more severe stutters). But yes, I can also confirm this, the more object there are, the more frequent the stutters are (and possibly the longer they take).

I am using gnome, also tried with sway so I could rule out mutter, with no amelioration.

Yeah, seems as though the issue is within osu!. As an aside, I also tried running the latest LST kernel (5.4) but the issue persisted, leading me to believe that the kernel is not the issue.

I'm not sure it matters, but the spike is actually visible on all 4 graphs

Yes, for me the Audio and Input graphs show it in blue (meaning Sleep), which makes sense if they're also dependent on Update like Draw is.

peppy commented 3 years ago

Please do not test using mania. This is the only ruleset which has not been optimised for memory usage. Test using the main menu (since you say you can reproduce there) or in another ruleset.

pinembour commented 3 years ago

Okay I'll keep that in mind, but I can still see that the difficulty correlates with the time between stutters in osu or taiko.

pinembour commented 3 years ago

Yeah, seems as though the issue is within osu!. As an aside, I also tried running the latest LST kernel (5.4) but the issue persisted, leading me to believe that the kernel is not the issue.

I'm not sure the blame falls fully on osu!. I remember the game working fine not that long ago, but even a build from november has the same issue atm. Are you using Arch or a derivative ? I'm using Arch, but when I tried a liveusb of Fedora I couldn't reproduce the issue.

Micha-ohne-el commented 3 years ago

So I recorded two more stutters with very controlled conditions so that they can be reproduced: I started the game, pressed the osu! cookie once, went to the top-bar-thingy to pause the music. Then I just sat there on the main menu, without doing anything, not even moving the mouse cursor. This resulted in this stutter: 2021-02-22_16-18-27 As you can see it is around 100 ms long. I had to wait quite a bit for it to happen (let's say 30 seconds). One thing to note is that the Update graph showed a blue spike this time! This is also reflected in the trace by a continuous purple line. No clue why it did that this time, it never did this any other time. After this, I went into the settings and clicked the Clear All Caches button to get this stutter: 2021-02-22_16-17-28 As you can see, it looks very different; for one it is much longer at nearly 650 ms, but the events happening are also very different from the one above, further suggesting that this has nothing to do with GC. Both traces can be found here: More Stutters.zip

Micha-ohne-el commented 3 years ago

I'm not sure the blames falls fully on osu!. I remember the game working fine not that long ago, but even a build from november has the same issue atm.

Yeah, I remember that too, and I can also confirm the exact same thing, even older builds produce this behavior for me, despite me being reasonably sure that those builds worked fine for me before.

Are you using Arch or a derivative ? I'm using Arch, but when I tried a liveusb of Fedora I couldn't reproduce the issue.

Yes, I'm using Manjaro, which is Arch-based. So it seems like some update in some package that Arch rolled out around a week to two weeks ago is causing this issue. Do you have any idea as to what package this could be?

pinembour commented 3 years ago

It could be, but I have no idea what the specific package would be.

peppy commented 3 years ago

I'm willing to look into this and potentially point you in the right direction if you can record a trace in the standard dotnet format using dotnet-trace collect -p <pid> (you will need to install the dotnet sdk and use dotnet tools to install the trace tool).

pinembour commented 3 years ago

Here is the trace from my system, with the exact same procedure @Micha-ohne-el described earlier. The stutter happened 45 seconds after launching the game and the trace. trace.zip

Micha-ohne-el commented 3 years ago

I'm willing to look into this and potentially point you in the right direction if you can record a trace in the standard dotnet format using dotnet-trace collect -p <pid> (you will need to install the dotnet sdk and use dotnet tools to install the trace tool).

@peppy, thank you for considering this! I recorded a trace with dotnet-trace. I was just sitting in the main menu after first launching the game (and entering and exiting the settings), so it took a short while for a stutter to occur. I tried to stop the trace as soon as that happened. I don't know how to visualize this sort of trace, but since you asked for it, I'm sure you'll know what to do with it: trace.nettrace.zip

If anything interesting happens related to this, or if I manage to fix it, I'll post my findings in here.

peppy commented 3 years ago

Second trace (from https://github.com/ppy/osu/issues/11800#issuecomment-784430804):

20210224 152037 (Parallels Desktop app)

This is definitely looking to be a very deep GC execution. Not something we generally see happen, so it's curious as to why it's running. If possible, could you provide a longer trace where you stay at the main menu the whole time, but encounter two of these stutters? I'd like to cross-compare them to make sure it's the same thing each time.

Also could you please specify your general memory stats (used / free / swap available?)

GC trigger reason:

Event Name Time MSec Process Name Rest
Microsoft-Windows-DotNETRuntime/GC/Start 24,480.796 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" Count="15" Reason="AllocSmall" Depth="1" Type="NonConcurrentGC" ClrInstanceID="0" ClientSequenceNumber="0"

GC stats:

Event Name Time MSec Process Name Rest
Microsoft-Windows-DotNETRuntime/GC/Stop 24,539.920 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" DURATION_MSEC="59.124" Count="15" Depth="1" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/HeapStats 24,539.921 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" TotalHeapSize="247,511,584" TotalPromoted="3,958,320" Depth="1" GenerationSize0="21,699,360" TotalPromotedSize0="1,538,960" GenerationSize1="1,491,296" TotalPromotedSize1="2,419,360" GenerationSize2="150,392,352" TotalPromotedSize2="0" GenerationSize3="73,928,552" TotalPromotedSize3="0" FinalizationPromotedSize="237,335" FinalizationPromotedCount="5" PinnedObjectCount="3" SinkBlockCount="151" GCHandleCount="46,199" ClrInstanceID="0" GenerationSize4="24" TotalPromotedSize4="0"

Weirdly it looks to start as a gen1, which shouldn't take this long in the first place, but the full log shows it going deeper? Also note that the GC isn't even under much load...

Event Name Time MSec Process Name Rest
Microsoft-Windows-DotNETRuntime/GC/Start 24,480.796 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" Count="15" Reason="AllocSmall" Depth="1" Type="NonConcurrentGC" ClrInstanceID="0" ClientSequenceNumber="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 24,480.799 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" Generation="4" RangeStart="0x7f5597fff000" RangeUsedLength="0x00000018" RangeReservedLength="0x07fff000" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 24,480.801 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" Generation="3" RangeStart="0x7f558ffff000" RangeUsedLength="0x04680f68" RangeReservedLength="0x07fff000" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 24,480.802 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" Generation="2" RangeStart="0x7f557ffff000" RangeUsedLength="0x08f6ce20" RangeReservedLength="0x08f6ce20" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 24,480.803 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" Generation="1" RangeStart="0x7f547ffff000" RangeUsedLength="0x0137c2e8" RangeReservedLength="0x0137c2e8" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 24,480.805 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" Generation="0" RangeStart="0x7f548137b2e8" RangeUsedLength="0x0517f950" RangeReservedLength="0x0ec82d18" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/MarkWithType 24,481.802 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" HeapNum="0" ClrInstanceID="0" Type="0" Promoted="280"
Microsoft-Windows-DotNETRuntime/GC/MarkWithType 24,481.804 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" HeapNum="0" ClrInstanceID="0" Type="1" Promoted="0"
Microsoft-Windows-DotNETRuntime/GC/MarkWithType 24,481.846 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" HeapNum="0" ClrInstanceID="0" Type="2" Promoted="60,096"
Microsoft-Windows-DotNETRuntime/GC/MarkWithType 24,486.546 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" HeapNum="0" ClrInstanceID="0" Type="3" Promoted="3,656,106"
Microsoft-Windows-DotNETRuntime/GC/BulkMovedObjectRanges 24,535.396 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" Index="0" Count="2,444" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GlobalHeapHistory 24,539.910 Process(22337) (22337) HasStack="True" ThreadID="22,388" ProcessorNumber="0" FinalYoungestDesired="83,886,080" NumHeaps="1" CondemnedGeneration="1" Gen0ReductionCount="0" Reason="AllocSmall" GlobalMechanisms="Compaction, Promotion, Demotion, CardBundles" ClrInstanceID="0" PauseMode="SustainedLowLatency" MemoryPressure="13" CondemnReasons0="0" CondemnReasons1="0"
Microsoft-Windows-DotNETRuntime/GC/PerHeapHistory 24,539.913 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" ClrInstanceID="0" FreeListAllocated="2,419,360" FreeListRejected="5,616" EndOfSegAllocated="0" CondemnedAllocated="0" PinnedAllocated="0" PinnedAllocatedAdvance="0" RunningFreeListEfficiency="94" CondemnReasons0="4" CondemnReasons1="48" CompactMechanisms="compact_low_ephemeral" ExpandMechanisms="not_specified" ConcurrentCompactMechanisms="" HeapIndex="0" ExtraGen0Commit="84,497,280" Count="5" MemoryPressure=""
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 24,539.918 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" Generation="4" RangeStart="0x7f5597fff000" RangeUsedLength="0x00000018" RangeReservedLength="0x07fff000" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 24,539.918 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" Generation="3" RangeStart="0x7f558ffff000" RangeUsedLength="0x04680f68" RangeReservedLength="0x07fff000" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 24,539.919 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" Generation="2" RangeStart="0x7f557ffff000" RangeUsedLength="0x08f6ce20" RangeReservedLength="0x08f6ce20" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 24,539.919 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" Generation="1" RangeStart="0x7f547ffff000" RangeUsedLength="0x0016c160" RangeReservedLength="0x0016c160" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 24,539.920 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" Generation="0" RangeStart="0x7f548016b160" RangeUsedLength="0x014b1b20" RangeReservedLength="0x0fe92ea0" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/Stop 24,539.920 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" DURATION_MSEC="59.124" Count="15" Depth="1" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/HeapStats 24,539.921 Process(22337) (22337) ThreadID="22,388" ProcessorNumber="0" TotalHeapSize="247,511,584" TotalPromoted="3,958,320" Depth="1" GenerationSize0="21,699,360" TotalPromotedSize0="1,538,960" GenerationSize1="1,491,296" TotalPromotedSize1="2,419,360" GenerationSize2="150,392,352" TotalPromotedSize2="0" GenerationSize3="73,928,552" TotalPromotedSize3="0" FinalizationPromotedSize="237,335" FinalizationPromotedCount="5" PinnedObjectCount="3" SinkBlockCount="151" GCHandleCount="46,199" ClrInstanceID="0" GenerationSize4="24" TotalPromotedSize4="0"
peppy commented 3 years ago

First trace (from https://github.com/ppy/osu/issues/11800#issuecomment-783984256)

20210224 154643 (Parallels Desktop app)

GC trigger reason:

Event Name Time MSec Process Name Rest
Microsoft-Windows-DotNETRuntime/GC/Start 45,088.094 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Count="14" Reason="AllocSmall" Depth="1" Type="NonConcurrentGC" ClrInstanceID="0" ClientSequenceNumber="0"

GC stats:

Event Name Time MSec Process Name Rest
Microsoft-Windows-DotNETRuntime/GC/Stop 45,129.702 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" DURATION_MSEC="41.608" Count="14" Depth="1" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/HeapStats 45,129.703 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" TotalHeapSize="192,293,344" TotalPromoted="9,875,232" Depth="1" GenerationSize0="24" TotalPromotedSize0="440,016" GenerationSize1="442,440" TotalPromotedSize1="9,435,216" GenerationSize2="117,489,816" TotalPromotedSize2="0" GenerationSize3="74,361,040" TotalPromotedSize3="0" FinalizationPromotedSize="243,375" FinalizationPromotedCount="11" PinnedObjectCount="0" SinkBlockCount="121" GCHandleCount="44,053" ClrInstanceID="0" GenerationSize4="24" TotalPromotedSize4="0"

This one is shorter, at 42ms, but that's still longer than we can live with. Very weird indeed.

Event Name Time MSec Process Name Rest
Microsoft-Windows-DotNETRuntime/GC/RestartEEStart 45,086.027 Process(281057) (281057) ThreadID="281,095" ProcessorNumber="0" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/RestartEEStop 45,086.032 Process(281057) (281057) ThreadID="281,095" ProcessorNumber="0" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/SuspendEEStart 45,087.091 Process(281057) (281057) ThreadID="281,095" ProcessorNumber="0" Reason="SuspendOther" Count="-1" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/SuspendEEStop 45,087.099 Process(281057) (281057) ThreadID="281,095" ProcessorNumber="0" DURATION_MSEC="0.008" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/RestartEEStart 45,087.229 Process(281057) (281057) ThreadID="281,095" ProcessorNumber="0" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/RestartEEStop 45,087.235 Process(281057) (281057) ThreadID="281,095" ProcessorNumber="0" DURATION_MSEC="0.144" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/SuspendEEStart 45,088.023 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Reason="SuspendForGC" Count="13" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/SuspendEEStop 45,088.029 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" DURATION_MSEC="0.006" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/Triggered 45,088.048 Process(281057) (281057) HasStack="True" ThreadID="281,139" ProcessorNumber="0" Reason="AllocSmall" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/Start 45,088.094 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Count="14" Reason="AllocSmall" Depth="1" Type="NonConcurrentGC" ClrInstanceID="0" ClientSequenceNumber="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 45,088.095 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Generation="4" RangeStart="0x7f318ffff000" RangeUsedLength="0x00000018" RangeReservedLength="0x07fff000" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 45,088.095 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Generation="3" RangeStart="0x7f3187fff000" RangeUsedLength="0x046ea8d0" RangeReservedLength="0x07fff000" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 45,088.095 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Generation="2" RangeStart="0x7f3177fff000" RangeUsedLength="0x0700c098" RangeReservedLength="0x0700c098" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 45,088.096 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Generation="1" RangeStart="0x7f3023fff000" RangeUsedLength="0x01a32748" RangeReservedLength="0x01a32748" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 45,088.096 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Generation="0" RangeStart="0x7f3025a31748" RangeUsedLength="0x050000b0" RangeReservedLength="0x0e5cc8b8" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/MarkWithType 45,088.303 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" HeapNum="0" ClrInstanceID="0" Type="0" Promoted="0"
Microsoft-Windows-DotNETRuntime/GC/MarkWithType 45,088.303 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" HeapNum="0" ClrInstanceID="0" Type="1" Promoted="0"
Microsoft-Windows-DotNETRuntime/GC/MarkWithType 45,088.400 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" HeapNum="0" ClrInstanceID="0" Type="2" Promoted="122,160"
Microsoft-Windows-DotNETRuntime/GC/MarkWithType 45,095.341 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" HeapNum="0" ClrInstanceID="0" Type="3" Promoted="9,482,610"
Microsoft-Windows-DotNETRuntime/GC/BulkMovedObjectRanges 45,120.223 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Index="0" Count="2,677" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/BulkMovedObjectRanges 45,120.610 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Index="1" Count="2,677" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/BulkMovedObjectRanges 45,120.869 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Index="2" Count="2,677" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/BulkMovedObjectRanges 45,120.984 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Index="3" Count="976" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GlobalHeapHistory 45,129.693 Process(281057) (281057) HasStack="True" ThreadID="281,139" ProcessorNumber="0" FinalYoungestDesired="83,886,080" NumHeaps="1" CondemnedGeneration="1" Gen0ReductionCount="0" Reason="AllocSmall" GlobalMechanisms="Compaction, Promotion, CardBundles" ClrInstanceID="0" PauseMode="SustainedLowLatency" MemoryPressure="25" CondemnReasons0="0" CondemnReasons1="0"
Microsoft-Windows-DotNETRuntime/GC/PerHeapHistory 45,129.696 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" ClrInstanceID="0" FreeListAllocated="9,435,216" FreeListRejected="0" EndOfSegAllocated="0" CondemnedAllocated="0" PinnedAllocated="0" PinnedAllocatedAdvance="0" RunningFreeListEfficiency="99" CondemnReasons0="4" CondemnReasons1="16" CompactMechanisms="compact_low_ephemeral" ExpandMechanisms="not_specified" ConcurrentCompactMechanisms="" HeapIndex="0" ExtraGen0Commit="108,285,856" Count="5" MemoryPressure=""
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 45,129.700 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Generation="4" RangeStart="0x7f318ffff000" RangeUsedLength="0x00000018" RangeReservedLength="0x07fff000" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 45,129.700 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Generation="3" RangeStart="0x7f3187fff000" RangeUsedLength="0x046ea8d0" RangeReservedLength="0x07fff000" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 45,129.701 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Generation="2" RangeStart="0x7f3177fff000" RangeUsedLength="0x0700c098" RangeReservedLength="0x0700c098" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 45,129.701 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Generation="1" RangeStart="0x7f3023fff000" RangeUsedLength="0x0006c048" RangeReservedLength="0x0006c048" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/GenerationRange 45,129.702 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" Generation="0" RangeStart="0x7f302406b048" RangeUsedLength="0x00000018" RangeReservedLength="0x0ff92fb8" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/Stop 45,129.702 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" DURATION_MSEC="41.608" Count="14" Depth="1" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/HeapStats 45,129.703 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" TotalHeapSize="192,293,344" TotalPromoted="9,875,232" Depth="1" GenerationSize0="24" TotalPromotedSize0="440,016" GenerationSize1="442,440" TotalPromotedSize1="9,435,216" GenerationSize2="117,489,816" TotalPromotedSize2="0" GenerationSize3="74,361,040" TotalPromotedSize3="0" FinalizationPromotedSize="243,375" FinalizationPromotedCount="11" PinnedObjectCount="0" SinkBlockCount="121" GCHandleCount="44,053" ClrInstanceID="0" GenerationSize4="24" TotalPromotedSize4="0"
Microsoft-Windows-DotNETRuntime/GC/RestartEEStart 45,129.705 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/RestartEEStop 45,129.729 Process(281057) (281057) ThreadID="281,139" ProcessorNumber="0" DURATION_MSEC="41.707" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/FinalizersStart 45,129.880 Process(281057) (281057) ThreadID="281,096" ProcessorNumber="0" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/FinalizersStop 45,129.918 Process(281057) (281057) ThreadID="281,096" ProcessorNumber="0" Count="11" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/FinalizersStart 45,129.920 Process(281057) (281057) ThreadID="281,096" ProcessorNumber="0" ClrInstanceID="0"
Microsoft-Windows-DotNETRuntime/GC/FinalizersStop 45,129.921 Process(281057) (281057) ThreadID="281,096" ProcessorNumber="0" Count="0" ClrInstanceID="0"
peppy commented 3 years ago

Going to reopen to track this as it looks like an actual issue.

The reason question here is that you both say that it worked fine at one point in time, which would point to a change in the way we are allocating memory or the underlying runtime. We recently switched to .NET 5 which could definitely have introduced some issues like this, but since you have already tried older builds (which should have the .NET runtime bundled) I'm not so sure..

Can you confirm that when you are checking old builds, you are using the AppImage, and not self-compiling or otherwise?

pinembour commented 3 years ago

Thank you for reopening and taking a closer look at it. I can confirm that I tried both compiling and the AppImage, on a recent and an old build.

I'll send a trace with multiple stutters in a few minutes.

peppy commented 3 years ago

The important part is that when you are testing on the old build, you are using the AppImage, or explicitly compiling with the older .NET runtime (3.x should do fine). Conversely, ensure you are building using .NET 5 when building the newer release (although I believe this is a requirement these days, so should be enforced at compile time).

Micha-ohne-el commented 3 years ago

Can you confirm that when you are checking old builds, you are using the AppImage, and not self-compiling or otherwise?

Yes, I was using the AppImage, both with the most recent build, as well as the one from November last year. This is why I still think the issue is with Arch somehow. I actually reinstalled my entire system yesterday (for a different reason) and the issue still persists. Maybe I should install a non-Arch based distro and see if the issue persists? Definitely very weird.

I'm currently on mobile, but I'll provide more traces and memory stats when I get on my PC.

pinembour commented 3 years ago

Here are two new traces. trace-new with a current version of the AppImage, trace-old with a version from november, still using an AppImage, 2020.1114.0. I definitely did not encounter the issue last november, but I do now. trace-new.zip Stutters at 40s, 1m09, and 1m40 trace-old.zip Stutters at 38s, 1m06, and 1m40

Both traces were recorded with dotnet 5.0.3.sdk103

peppy commented 3 years ago

Just for confirmation, can you show how you are running the appimage? The .NET runtime should be bundled inside, so I want to confirm you aren't somehow forcing the old version to run on .NET 5 (which is your system installed runtime/sdk version), as this is feasible and could mean a lot in figuring out where this regressed.

pinembour commented 3 years ago

I chmod +x the appimage, and either run in via cli or by doucle clicking on it. Is there a way to ensure what runtime is used ? For the recent one, in order to capture the trace from startup I used dotnet-trace collect -- ./osu.AppImage When I tried to use the same method on the build from november it complained about this method not working on .NET 3.1, so I suppose it uses the included runtime.

peppy commented 3 years ago

That only brings more questions. We only started publishing .NET 5 releases in https://osu.ppy.sh/home/changelog/lazer/2021.118.1. All before that should be 3.1.

pinembour commented 3 years ago

I don't see the issue here, the build from november seems to be using dotnet 3.1 since dotnet-trace complains that it can't attach to startup on versions before 5. It works fine on the recent one as it uses dotnet 5 as it should

peppy commented 3 years ago

How are you getting the trace then? Attaching post-startup?

pinembour commented 3 years ago

Yes, using dotnet-trace collect -p <PID>

smoogipoo commented 3 years ago

@peppy : If I'm reading those correctly, these are cases where the GC is doing a compaction even though it's basically done no "worthwhile" (for lack of a better word) work in order to justify doing a compaction. It might be the compaction that's causing the large GC time, but if so the GC should not be doing compactions so often in the first place.

There are two things that are worrying to me:

  1. That Gen0 is getting so large in the first place. I experience this too and it just feels very wrong: image
  2. The high value for ExtraGen0Commit. It's hard to find documentation as to what this is, but if it's the total amount of memory allocated to these heaps internally (but not necessarily in use), then maybe this indicates high fragmentation?

Furthermore I'm frequently seeing something like the following after clearing caches after exiting gameplay: image The size never goes down from this after multiple collections. This is in menu where no allocations are happening.

I think I'm able to replicate the stutters seen here, but have not investigated too deeply - dived into SOS and seen nothing too abnormal form the get-go and looked into adding more GC stats for the F2 window (specifically that Commit value, which doesn't exist in ETW events).

peppy commented 3 years ago

Those gen0 sizes look way too large. Are you able to try commenting out the GC mode specification we have and see if it helps/changes things at all? https://github.com/ppy/osu-framework/blob/23ca45ece4b559d3fd255f987a16a6f43111e547/osu.Framework/Platform/GameHost.cs#L531

smoogipoo commented 3 years ago

Doesn't really improve it by commenting it out. If anything it actually shows the issue even more clearly.

Here's two videos I captured..

  1. https://drive.google.com/file/d/1uGSAfdGn9ZkOudOPNuoowBdmFxOOrgbW/view
    • Note how GC jumps between 15MB and 24B with every collection.
  2. https://drive.google.com/file/d/1355Iaeo5io5Y30CRyXky3G755SEW7BmL/view?usp=sharing
    • Note how GC goes down to 24B with every manual collection, then up to 33MB by itself shortly afterwards (a secondary collection?).

Along with that, you can still see the great disparity in size between Gen0 and Gen1... I'm not sure how to explain any of this - probably requires manual digging into what's actually on the heap, if you want me to go into that.

Like... Are we allocating 33MB between two GC cycles to exhibit this behaviour?

peppy commented 3 years ago

It really does look like something is allocating in 16mb blocks. Those display numbers only update on collection, so it looks like sometimes when you press the button there have been (close to) zero allocs, which is what I would expect at the main menu, then others something has done a massive alloc.

Are you doing this from the main menu after a fresh startup? If so, I can't reproduce on macOS, so it may be something platform specific:

https://user-images.githubusercontent.com/191335/108975308-b27d6b00-76c9-11eb-82f7-007c63d034f4.mp4

smoogipoo commented 3 years ago

Yeah that's on a fresh startup. Are you logged in? I've noticed that this gen0 one doesn't happen when not logged in when performing the same inputs. Maybe something signalr?

peppy commented 3 years ago

Yep, logged in to the dev server

pinembour commented 3 years ago

Here is a capture of my screen while the stutters are hapenning. The graphs jump all over the place since I'm recording at the same time but the stutters are still clearly visible around 3s and 25s. The stutters seem to coincide with collections, be they gen0 or gen1. video.zip

peppy commented 3 years ago

As we saw in the event logs, in your video it's only reporting a gen0 collection, which should be basically instant. I believe @smoogipoo has found a potential bug causing this to occur, so we should have enough information to act on this, thanks!

Micha-ohne-el commented 3 years ago

This is definitely looking to be a very deep GC execution. Not something we generally see happen, so it's curious as to why it's running. If possible, could you provide a longer trace where you stay at the main menu the whole time, but encounter two of these stutters? I'd like to cross-compare them to make sure it's the same thing each time.

Also could you please specify your general memory stats (used / free / swap available?)

@peppy, I'm not sure you still need this, there has been a lot of development here, but I recorded more stutters on the main menu, this time with memory info.

To gather the memory info, I used this shell script:

#!/bin/bash

last_rss=0
while true; do
  time=$(date +%X.%N)
  rss=$(ps -o rss $(pgrep 'osu!') | grep -v RSS)
  rss_diff=$(expr $rss - $last_rss)
  last_rss=$rss

  echo -e "$time\t$rss\t$rss_diff" >> memoryinfo.txt
done

It outputs the memory usage of osu! in this format:

time    current_mem mem_difference

time is current system time, and the values for current_mem and mem_difference are in KB. The values refer to the process's “resident set size, the non-swapped physical memory that a task has used” (from man ps). As I don't have any swapspace, this should be the total memory used by the process.

Here are the results of both the trace and the memory info script: profile_osu.zip There were around 4 or 5 stutters total. Again, I was just sitting in the main menu, doing nothing, playing no music. And to clarify, this was recorded on the latest AppImage build using dotnet-trace collect -n osu!.

smoogipoo commented 3 years ago

Are you able to compile the latest master and check if it's any better for you?

pinembour commented 3 years ago

There are still stutters, but it feels a little better. It might be placebo I'm not sure. Here's what I'm seeing on latest master :

Here's a new trace from master, near the start there is a stutter with Collections Gen1 incrementing, a little later there is a second stutter when Collection Gen0 and Collection Gen2 both increment at the same time. trace_master.zip

Micha-ohne-el commented 3 years ago

I can confirm the exact same things that @pinembour is describing. It's maybe a little better but the stutters definitely still happen. I also recorded another trace: trace.nettrace.zip I compiled the latest master branch (commit a792c3f13c21d207378e8dbdf151e84ad7e033ef) with dotnet run --project osu.Desktop -c Release. dotnet --version gives 5.0.103 and that is the latest version that I can install atm.

peppy commented 3 years ago

And just to confirm beyond doubt: both of you are able to reproduce these stutters after starting the game, at the main menu without any gameplay involved?

Also: are you compiling locally in release mode? This is important when testing these kinds of issues (dotnet run -c Release).

pinembour commented 3 years ago

Yes for me for both questions.

Micha-ohne-el commented 3 years ago

Yes and yes

smoogipoo commented 3 years ago

Does running with the following environment variables make it any better?:

COMPlus_TieredCompilation=0
COMPlus_TC_QuickJit=0
COMPlus_TC_QuickJitForLoops=0
pinembour commented 3 years ago

It doesn't look like those environment variables change anything.

This has nothing to do with the environment variables, but in my testing I found something odd. I was trying to reproduce on a liveusb of Manjaro, but I couldn't. However, the F2 menu reaveled something that might be useful :

peppy commented 3 years ago

We're still looking into this and do have some leads, but in the mean time: are you able to re-test at the main menu with the frame graphs hidden? Note that you may want to test for a longer period of time, as with them hidden there are less allocations and it may take longer for a GC run to kick in.

So far I have been able to draw a direct correlation to having that display visible and seeing increased GC time.

pinembour commented 3 years ago

Thank you for your hard work. I tested again on the main menu without the frame graph and let the game run for 3 minutes. There were no stutter that I could find, but with so little moving it's really hard to catch them. I tried osu!taiko and even with the frame graph and the f2 menu closed there were noticable stutters.

Micha-ohne-el commented 3 years ago

I recorded another trace. Latest master (a792c3f13c21d207378e8dbdf151e84ad7e033ef), FPS graph / display completely off. I was sitting in the main menu again, this time circling my cursor for a couple minutes. I didn't see any stutter but that doesn't mean that there was none, they're hard to see when nothing is really moving, as @pinembour said). Either way, here is the trace: trace.nettrace.zip