iXit / wine-nine-standalone

Build Gallium Nine support on top of an existing WINE installation
GNU Lesser General Public License v2.1
272 stars 23 forks source link

Halo CE fps drops when particles are enabled, only occurs on wine-nine #85

Closed ghost closed 3 years ago

ghost commented 3 years ago

The frame rate drops severely in Halo CE (2001) when I enable "Particles" in the video settings menu when using wine-nine. Particles are generated when there is an in-game explosion, and when an explosion occurs there is a sudden drop in frame rate. There is no drop in frame rate when an explosion occurs if I disable the Particles setting, a slight drop if I set Particles to Low, and a large drop if I set it to High.

This issue only occurs in wine-nine, regular wine does not have a severe frame rate drop when there are particles. It occurs regardless of other visual settings or screen resolution. There is no throttling taking place when this occurs (CPU and GPU power management disabled).

The game runs much faster in wine-nine than under regular wine, especially on large custom maps. I normally play the game with fps capped at 60 using Chimera, and can reach 20fps or even lower when many explosions are present. Otherwise, the game runs at a smooth 60 even on large custom maps (something regular wine can't do).

Unfortunately I am unable to do an apitrace for Halo CE, the trace plays back with a black screen and buffer overflow errors. The exact issue has already been reported to apitrace developers in 2016, and has not been fixed.

Log:

GALLIUM_HUD=fps,cpu0+cpu1+cpu2+cpu3,GPU-load wine haloce.exe 
008c:err:ntoskrnl:ZwLoadDriver failed to create driver L"\\Registry\\Machine\\System\\CurrentControlSet\\Services\\wineusb": c0000142
0024:err:winediag:wined3d_dll_init Setting multithreaded command stream to 0x1.
Native Direct3D 9 v0.7.0.368-release is active.
For more information visit https://github.com/iXit/wine-nine-standalone
0024:err:winediag:MIDIMAP_drvOpen No software synthesizer midi port found, Midi sound output probably won't work.
Native Direct3D 9 v0.7.0.368-release is active.
For more information visit https://github.com/iXit/wine-nine-standalone
Using profile path .
fixme:d3d9nine:DRIPresentGroup_GetMultiheadCount (0xd2f508), stub!
fixme:d3d9nine:DRIPresentGroup_GetMultiheadCount (0xd2f508), stub!
Loading font fonts\Hack-Bold.ttf...done
Loading font fonts\Interstate-Bold.ttf...done

My system runs Ubuntu 18.04, with wine 5.10 staging and the latest 0.7 wine-nine. GPU is an AMD HD6670 (Mesa 20.0.8, Radeon driver). This hardware would have no problem playing Halo CE maxed out on Windows.

I'm happy to provide additional info if needed. Thanks!

dhewg commented 3 years ago

That sounds familiar, didn't we have a report of the very same issue somewhere?

dhewg commented 3 years ago

Hm, can't find it, maybe @axeldavy remembers

dungeon007 commented 3 years ago

Well, this patch slowed it down: https://github.com/iXit/Mesa-3D/commit/7dc57d506d0cf3fddb7288a5cbe2d740b2113aec But that is to prevent flickering with default buffered render... That said, you can try to revert that, but also if you disable buffering in a config.txt. That perform slower for max fps, but wont flicker.with reverted patch like default do and of course particles shouldnt bring down fps like that... And since you wanna playing at 60 capped framerate, that might be ideal for you. 😃

dungeon007 commented 3 years ago

Dunno if instruction is needed 😃 This is the one you want: "DisableBuffering - Forces a video card to render each scene - used to prevent mouse lag" And then just below Vendor = 0x1002 "ATI" put something like: 0xXXXX = "Radeon HD6670" DisableBuffering break Where XXXX is of course your pciid.

dungeon007 commented 3 years ago

"I normally play the game with fps capped at 60 using Chimera..." So Chimera, https://github.com/SnowyMouse/chimera

For Chimera as i see... that should be 'chimera_block_buffering' or Block all bullshit 😃

dungeon007 commented 3 years ago

Seems DXVK suffers also with this enabled https://github.com/doitsujin/dxvk/pull/1733

And again it seems sits there just because of HALO's default, if only would all people disable that buffereng... 😃

dungeon007 commented 3 years ago

Yep, even dxvk>vulkan seems slower than WINE on particles there... so, i guess this should be reverted and instad to advise people to disable that buffering.

Joshua-Ashton commented 3 years ago

Saw this issue referenced on the DXVK tracker.

We managed to work around the locking contension by doing an implicit discard + copy in the contested scenario.

dungeon007 commented 3 years ago

Sure and that latest chimera mod seems have vulkan renderer now: http://vaporeon.io/hosted/halo/chimera/chimera-latest.7z

Anyway, this bug is i believe about older hardware (Radeon HD6670) that does not have vulkan.

dungeon007 commented 3 years ago

I mean, i dont have such TeraScale2 hardware to test... maybe it is much slower with disabed buffereng on that, no idea. 😃 ping @spsx

dungeon007 commented 3 years ago

Sure, thanks, so if you are with disabled buffering already then you will be fine with patch reverted.

Joshua-Ashton commented 3 years ago

@dungeon007 What I suggested works on any era of hardware

dungeon007 commented 3 years ago

@Joshua-Ashton Yeah, if we talk about preventing flickering in default buffered mode, that seems to work in dxvk now... but we talk about performance of particles here, even with current dxvk 1.8.1 fps still go down worse than even WINE. 😃

Joshua-Ashton commented 3 years ago

Can you make an issue about that with a trace? That should be fixed afaik.

dungeon007 commented 3 years ago

Well, situation on dxvk with particles isnt so worrying, diff is kind of acceptable so 15-20% slower than GL WINE, while in this bug with nine fps drops with particles is like 70% slower than WINE 😃

dungeon007 commented 3 years ago

And a thing is, if i revert patch on DXVK it flickers both ways (game default or with disabled buffer), like it work the same - so that would be bad revert there 😃 Meanwhile on nine if i revert patch, it works differently... max fps become slower, but there it does not flicker with disabled buffer anymore, just default flickers.

axeldavy commented 3 years ago

Well systemmem is a bit weird in its behaviour and works differently with lock flushing nvidia vs amd vs intel. I'll check the results of our test scripts, see if i can make sense of how to implement it efficiently and conformant

dungeon007 commented 3 years ago

Most efficient is to just disable particles and decals in this game, as that could kill perf if you go crazy. Decals are even crazier, as one could just constantly shot at one spot at a wall and to kill perf if looking closer at it. 🤣

dungeon007 commented 3 years ago

It could be improved, no doubt. Just that if you give too much freedoms to people, they will make CS:GO benhmark with too much smoke particles that could kill any GPU in existence 🤣

Joshua-Ashton commented 3 years ago

@dungeon007 It's not the concept of smoke particles that are the problem, it's just that Halo CE is dumb in the way it implements them. Source's don't have this problem.

@axeldavy D3D9 locking bugs have at least taken 10 years off my lifespan. They're the worst. x)

axeldavy commented 3 years ago

I suspect for SYSTEMMEM, the GPU is supposed to copy into a GPU buffer only the relevant area for the draw call. The flags nooverwrite, discard, etc might help some optimizations to enable reuse of previous data uploaded. The difference of behaviour between the vendors relative to double locks, draw while in lock, threading, etc might be due to differences in implementation of when they do the upload. It can probably be implemented like MANAGED, but with maybe slightly altered behaviour for how to use the flags passed.

dungeon007 commented 3 years ago

@Joshua-Ashton Well, because that Halo default flickering patching, we are slowing down other apps too 🤣 From your bugzilla: https://github.com/doitsujin/dxvk/issues/1730 https://github.com/doitsujin/dxvk/issues/1828 ... Here on nine, if i revert that patch it improve perf on all these... otherwise these Blood Rayne 1 and 2 runs like couple times slower than WINE, etc... because of the same thing. Commented on Rayman Origins there too, unrelated... but OK 🤣

axeldavy commented 3 years ago

@Joshua-Ashton You're welcome XD. Well I checked the results of our tests when we investigated buffer locking behaviour for nine, and given the results (intel vs amd vs nvidia on win7/10), I think what makes the most sense is if SYSTEMMEM is basically like MANAGED except the upload from the ram buffer to the device buffer is handled by the driver. Indeed each vendor seems to have slightly different conditions for the upload, while MANAGED is consistent across all of them.

Here is a reminder for MANAGED: . Lock does dirty the buffer, not unlock . The draw call triggers the upload. Compatible with internal threading (upload is scheduled for later if using internal threading, but locking again without READONLY will flush the thread). . If you draw again without dirtying the buffer you don't upload again . You get always the same pointer when locking, even when passing DISCARD.

Differences relative to MANAGED for SYSTEMMEM (consider I might be wrong there analysing these old tests) . On Nvidia the draw call triggers the upload right now. (Probably because it flushes the internal thread), but not on Intel/AMD . Nooverwrite has same effect as Readonly for some vendors but not others (at least for some of the tests - if it were always true, halo would likely not work as it uses nooverwrite only for some systemmem resources, or the drivers have workarounds for it)

Thus it is probably OK to implement SYSTEMMEM exactly like MANAGED. The important question unanswered by the tests though is whether only what is needed by the draw call is uploaded, or if the whole dirty area is uploaded. I think for MANAGED we had determined the whole dirty area was uploaded, but I'm not 100% sure. SYSTEMMEM could be different.

Back to Halo, the game seems to use nooverwrite in a round fashion on systemmem index buffers (going back to the start of the buffer without discard). Then nooverwrite are not overlapping with each other consecutively. Systemmem vertexbuffers seem to have various behaviours (discard, nothing, etc).

I suspect the reason it passes nooverwrite is as a hint for a given driver hack/optimisation. Maybe without it that driver has to upload the whole systemmem buffer when dirty because some apps write outside of the locked bounds or something like that.

As for nine, given the above, I think the sane thing to do is to use the MANAGED path for SYSTEMMEM. This should fix the performance issue. Hopefully the bug with particles will not come back.

dungeon007 commented 3 years ago

We probably just need switchero to: 1) default, not to 2) to discard write only to either write 3) or read_write 🤣, just found that this helps performance for HALO and Blood Rayne: PIPE_MAP_READ_WRITE | PIPE_MAP_DISCARD_WHOLE_RESOURCE; Perf now flies and goes above WINE 🤣 That is without previos patch reverted!

dungeon007 commented 3 years ago

I mean, default as default as it is now and two choices, either: PIPE_MAP_READ_WRITE | PIPE_MAP_DISCARD_WHOLE_RESOURCE; or PIPE_MAP_WRITE | PIPE_MAP_DISCARD_WHOLE_RESOURCE;

axeldavy commented 3 years ago

Well you should definitely try to use MANAGED instead of SYSTEMMEM.

I pushed a patch for testing in the ixit Mesa-3D that does use the MANAGED path for SYSTEMMEM. You can try that too. It should give the same performance as your hack while hopefully adding no visual issue.

dungeon007 commented 3 years ago

Sure will try, that... and congrats on new patches entering mesa, more testers are always better 🤣

dungeon007 commented 3 years ago

Only that i dont see such patch there 🤣, but OK, never mind didnt have a time for testing today anyway.

axeldavy commented 3 years ago

Sorry, fixed.

dungeon007 commented 3 years ago

Blood Rayne now perform fine with that patch, HALO becomes flickering fiesta in both modes 🤣 , Far Cry 20% slower, Tomb Raider Legend somewhat same slower like FC, Dungeon Siege 2 - now 3 times slower... that said i think i will stop testing this, not good... that was quick. 🤣

axeldavy commented 3 years ago

Could you describe an easy way to reproduce the halo issue ? I guess there is something I missed there...

For the slower games I guess I'll need to know the usage pattern of the buffers. Would it be possible to have a trace of the affected games ?

EDIT: It'd be interesting to know with disabling csmt helps in any way with the halo issue and the performance regression csmt_force=0
Indeed this affects when the MANAGED data is uploaded.

dungeon007 commented 3 years ago

There is default mode on HALO and second mode is if you disable buffering, in config.txt i mean or via chimera... I just do the same thing that OP described in first post in a video... firing the same way, with bombs and that rocket launcnher 🤣 HALO you have to try isnt it? And for Blood Rayne there is some trace in a dxvk bug here: https://github.com/doitsujin/dxvk/issues/1828

dungeon007 commented 3 years ago

Fast way... max evertything out in HALO, Multiplayer>LAN>Battle Creek>OK>OK. Go up there on the arch to take that Rocket Launcher, then go down and fire the same way like you see in the video... and that is it 🤣

dungeon007 commented 3 years ago

Bombs on the left mouse button, rocket on the left... more particles around at the same time, the better. And preferably in disabled buffer mode, as that is what OP did too 🤣

dungeon007 commented 3 years ago

In that mode particles are worse: Literally speaking, DXVK 16.66% worse whan WINED3D on default and 33.33% worse with disabled buffering mode. NINE 66.66% worse than WINED3D in disabled buffering mode, that is what this bug is about 🤣 You just have to fire bombs and rockets often and consistenly and that to be more closer to you, for the best enjoyement of fps drops 🤣 OP likes to cap fps at 60 and fps rate to never drop bellow that. But with such framerate drops with particles down to 26 fps as seen in video is something unexpected as that does not happen in Windows and so on and so worth... Probably on a weaker hardware you test the better 🤣

dungeon007 commented 3 years ago

Hardware that could pass thorough smoke in CS:GO benchmark, without fps dropping below 60 fps (if such hardware even exist on this planet Earth) is not good for testing 🤣

dungeon007 commented 3 years ago

As i said, you just need switchero to default as it is or to either: PIPE_MAP_READ_WRITE | PIPE_MAP_DISCARD_WHOLE_RESOURCE; Just for HALO, Blood Raynes... or PIPE_MAP_WRITE | PIPE_MAP_DISCARD_WHOLE_RESOURCE; There are apps that would wanna this too... or maybe disard write is good to go as default too (didnt spot that it hurts perf of anything) and just to switch to disacard read_write for problematic apps like these two. And that is it, from mine experience and regardless of what docs are saying. 🤣

axeldavy commented 3 years ago

Which graphic card do you have ? On r600/radeonsi, I would have expected the path taken for small MANAGED updates to be fast.

Anyway there are ways to make the usage pattern fast, but first we have to figure out what is required to be conformant (have no bugs). Uploading only the intersection of the dirty region with the region needed for the draw call might be the answer.

dungeon007 commented 3 years ago

Tested yesterday your branch on Bonaire card that is GCN 1.1, posting this now from Kabini APU that is also GCN 1.1, as i wanna test these perf issues even better 🤣

dungeon007 commented 3 years ago

BTW, they are both on amdgpu driver of course, not on older radeon. But i could switch to that for testing, reboot away. 🤣

axeldavy commented 3 years ago

Don't waste your time on that. I'm more interested on investigation of the performance and conformance issues.

For example what about dirtying the whole region when locking a managed buffer ? What about disabling csmt (csmt_force=0) ? etc

dungeon007 commented 3 years ago

Sure thing, but what about this comment from https://github.com/doitsujin/dxvk/issues/1828: "Poor performance is only seen through dxvk (d3d9) and Gallium Nine." Seems like, with conformance there is no performance 🤣

Joshua-Ashton commented 3 years ago

@axeldavy I envy your patience with this guy. :frog:

dungeon007 commented 3 years ago

And for about HALO's conformance, and on a new bugs about it: https://github.com/iXit/wine-nine-standalone/issues/98 we have first to ask if user have disabled buffereing or not, as these are different 🤣

dungeon007 commented 3 years ago

@Joshua-Ashton Don't worry, be happy. We were talking about this two years ago, it is just deja vu 🤣

axeldavy commented 3 years ago

@Joshua-Ashton I assume you're interested as well in knowing what is exactly the expected behaviour for these SYSTEMMEM buffers. Unfortunately the tests we had done for Nine years ago do not seem sufficient. I don't have enough time to write better tests right now, and the hw I used to test the intel and amd win10 behaviour died. Would you work on it (It would need 1-2 days of work I guess) ?

Joshua-Ashton commented 3 years ago

I'll take a look soon-ish and let you know :)

axeldavy commented 3 years ago

Alright here is the tests we had buffer_experiments.tar.gz

Attached is a patch to the wine visual tests with a test that checks the vertex buffer locking behaviour. It is quite hackish. The tests supposedly were made to pass on nvidia hw. Attached are the test results on intel and amd. I began writing a better test (the other patch that applies on top), but I don't have the results. When on windows you assign the affinity of the exe to a single cpu thread, you get single-threaded behaviour. Else you get internal threading. Testing in both modes is quite useful to know what is going on.

In addition, the DDI documentation https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/d3dumddi/nc-d3dumddi-pfnd3dddi_lockasync?redirectedfrom=MSDN is quite useful to guess what is going on. For example you can see that the driver can get asked to create a buffer in the systemmem pool, but not the managed pool. You can see that LockAsync (which is likely to be used for locks when internal threading is activated) can return an error if "The user-mode display driver does not support LockAsync for the specified resource." I bet Nvidia returns an error for Systemmem, but not AMD and Intel which would explain the behaviour observed (lock write draw write unlock -> the second data is used on AMD and intel for the draw when threading is ON. Else the first data. NVidia always the first data).

Don't try to analyse the tests results, I already did in my summary a few messages ago, but here is a complement of the conclusions you can draw from the tests:

DEFAULT Pool: Except some exceptions (AMD dynamic), all flags and vendors behave the same way. Implementing lock, nooverwrite and discard the most obvious way will give the same behaviour (Likely DXVK is already ok there. Nine is.).

MANAGED Pool: All vendors behave the same (likely because it is managed by the runtime). Lock does dirty the area, and it is uploaded when the draw call command is sent to the driver (that is when the client requests a draw call if the threading is off. Later else). Locking with readonly doesn't dirty the region (nor, I think, triggers a flush of the thread queue). Unlock doesn't dirty the area. The MANAGED buffer starts dirty. In addition, while not in the tests, I think we have found (buggy game writing to the managed buffer with Readonly long after the creation) that the draw calls only trigger the upload of bound buffers. -> Question to answer: Is only the area needed for the draw call uploaded or the whole dirty region is ? I strongly suspect the later, but maybe I am wrong. To test that we would need to draw something with the buffer (because it starts full dirty), then lock/write to a region [0, b] while writing out of the locked region, then draw using the data out of the locked region, and see if the old content or the new content is used (the initial draw must use a different region. Basically the test is to know if the initial draw did upload the whole buffer or just what was needed, assuming the second draw won't upload an area not set dirty). -> Question to answer: If you the buffer is bound, but you don't need the stream for the draw call, is the dirty area still uploaded ? -> Question to answer: if you dirty the area [0, a] and [b, c], is the whole [0, c] area set dirty ?

SYSTEMMEM Pool: There are differences between vendors, but the behaviour matches more MANAGED than DEFAULT. For example if you do "lock write draw write unlock" and threading is disabled, the first data is used for all vendors, which is not the case for DEFAULT. In addition the same pointer is always returned, even when you pass DISCARD. -> Question to answer: Same questions for MANAGED. Given the issues reported in this thread, there are likely some differences.

axeldavy commented 3 years ago

I tried the Battle creek and I get about 600 fps (sometimes more than 1000), and when I launch a grenade or a rocket, it displays perfectly fine, and above 200 fps (there is a fps drop, but above that).

axeldavy commented 3 years ago

Instead of using MANAGED buffers for SYSTEMMEM, I used the old path, but passing NOOVERWRITE and using that hack

if (This->base.pool == D3DPOOL_SYSTEMMEM) {
        Flags &= ~(D3DLOCK_DISCARD);
        if (Flags & D3DLOCK_NOOVERWRITE && OffsetToLock == 0)
            Flags |= D3DLOCK_DISCARD;
    }

Basically I ignore discards (game gives them for nothing) but force a discard when it restarts filling the buffer from scratch.

With this it gives almost no performance decrease with the smoke effects (>500 fps). However that is with PIPE_USAGE_STAGING, which is a buffer in GTT without WC. In fact when using anything different, (GTT WC or VRAM), the performance decrease is back. This seems to indicate the buffer is filled in a pattern that doesn't make WC happy (or maybe the game reads the data in the buffer).