Halo CE fps drops when particles are enabled, only occurs on wine-nine

ghost commented 4 years ago

The frame rate drops severely in Halo CE (2001) when I enable "Particles" in the video settings menu when using wine-nine. Particles are generated when there is an in-game explosion, and when an explosion occurs there is a sudden drop in frame rate. There is no drop in frame rate when an explosion occurs if I disable the Particles setting, a slight drop if I set Particles to Low, and a large drop if I set it to High.

This issue only occurs in wine-nine, regular wine does not have a severe frame rate drop when there are particles. It occurs regardless of other visual settings or screen resolution. There is no throttling taking place when this occurs (CPU and GPU power management disabled).

The game runs much faster in wine-nine than under regular wine, especially on large custom maps. I normally play the game with fps capped at 60 using Chimera, and can reach 20fps or even lower when many explosions are present. Otherwise, the game runs at a smooth 60 even on large custom maps (something regular wine can't do).

Unfortunately I am unable to do an apitrace for Halo CE, the trace plays back with a black screen and buffer overflow errors. The exact issue has already been reported to apitrace developers in 2016, and has not been fixed.

Log:

GALLIUM_HUD=fps,cpu0+cpu1+cpu2+cpu3,GPU-load wine haloce.exe 
008c:err:ntoskrnl:ZwLoadDriver failed to create driver L"\\Registry\\Machine\\System\\CurrentControlSet\\Services\\wineusb": c0000142
0024:err:winediag:wined3d_dll_init Setting multithreaded command stream to 0x1.
Native Direct3D 9 v0.7.0.368-release is active.
For more information visit https://github.com/iXit/wine-nine-standalone
0024:err:winediag:MIDIMAP_drvOpen No software synthesizer midi port found, Midi sound output probably won't work.
Native Direct3D 9 v0.7.0.368-release is active.
For more information visit https://github.com/iXit/wine-nine-standalone
Using profile path .
fixme:d3d9nine:DRIPresentGroup_GetMultiheadCount (0xd2f508), stub!
fixme:d3d9nine:DRIPresentGroup_GetMultiheadCount (0xd2f508), stub!
Loading font fonts\Hack-Bold.ttf...done
Loading font fonts\Interstate-Bold.ttf...done

My system runs Ubuntu 18.04, with wine 5.10 staging and the latest 0.7 wine-nine. GPU is an AMD HD6670 (Mesa 20.0.8, Radeon driver). This hardware would have no problem playing Halo CE maxed out on Windows.

I'm happy to provide additional info if needed. Thanks!

axeldavy commented 3 years ago

Are you comparing performance relative to mesa master or to mesa master + some hacks ?

As for Unigine Valley, as the app is doing a lot of discards during its rendering loop, it makes sense for it to use a lot of memory. Maybe this is a different issue to yours.

axeldavy commented 3 years ago

For Unigine Valley, I get low GTT usage if I disable csmt (csmt_force=0 ) and buffer_upload. If I use csmt or buffer_upload I get high GTT usage. I think it comes down to pipeline flushes. I'm not sure the GTT usage really is lower, but probably it justs marks the buffer as released. Did you have csmt on when you managed to get low GTT without buffer_upload ?

axeldavy commented 3 years ago

@dungeon007 It seems when csmt is ON or when buffer_upload is not NULL, we can overconsume GTT if the app is doing a lot of DISCARDs (Unigine Valley does). I understand why (it's due to gallium resource cache), and I have a few ideas how to fix it. I'd like to know if it's the problem you hit could be this one or a different one. Could you tell me if you had csmt OFF or ON in your working examples ?

axeldavy commented 3 years ago

@Joshua-Ashton In our tests we had seen that DISCARD does nothing on DEFAULT buffers if they are not busy. Valley locks in a round fashion its index buffers but instead of using DISCARD then NOOVERWRITE, it uses DISCARD then DISCARD | NOOVERWRITE. We both know that DISCARD is preferred when NOOVERWRITE is set. But here as the locked region was not busy since the last DISCARD, I am wondering if that could be a case where the DISCARD is ignored. Do you know the answer ?

Joshua-Ashton commented 3 years ago

Potentially it could be, I am not sure though. I might have a bash at implementing dropping of discards in the non-contested case in DXVK.

dungeon007 commented 3 years ago

@axeldavy Nope, PoP case not affected by csmt on or off. Two years back i remember when i was testing this, i dont needed to zero it, i could bumb it more and GTT usage jumps gradually more and more, as i could set up to 1.5 there instead of 4, but 2 or greater and then bug starts. Well, probably some app gave me better perf with 0, so i go with that 🤣 Just to mention this APU defaults to 512MB as VRAM, but i could set anything form 64MB up to 2GB... and going up from default didnt mattered for this bug at all.

dungeon007 commented 3 years ago

I think it was like i could set it 0 or 0.5 or 1 or 1.5... but on 2 there we go. No idea why, might be somehow some 32bit limit came in there or whatever 🤣

axeldavy commented 3 years ago

@Joshua-Ashton The reason I never cared about not discarding in the non-contested case is that I have never seen any app discarding in an non-contested case. So not worth the addded overhead in my opinion.

@dungeon007 So, just to confirm, PoP is fine with csmt on and buffer_upload set to NULL ?

dungeon007 commented 3 years ago

Sure, everything seems fine if i set it to null 🤣

axeldavy commented 3 years ago

@dungeon007 Well in that case I will need you do to do more tests, because I won't figure it out alone. Would you mind ?

@dungeon007 I think I know how to fix the artifacts you have seen. Another user reported them on the mesa merge request. However I have a question for you about the performance drop: Are we talking about a performance drop below 60 fps or not ? Because if it is a performance drop, but performance is excellent, I might not bother with the remaining optimization to fix this drop.

axeldavy commented 3 years ago

@dungeon007 If you want to help debug buffer_upload, could you come on irc's #d3d9 channel ? It will be easier to talk.

The first thing to try is that:

This->buffer_upload = nine_upload_create(This->context.pipe, 4 * 1024 * 1024, 4);

in device9.c (NOTE It means it will require csmt to be set to OFF or it might lead to corruptions) and then in nine_buffer_upload9.c:

static void
nine_upload_destroy_buffer_group(struct nine_buffer_upload *upload,
                                 struct nine_buffer_group *group)
{
    DBG("%p %p\n", upload, group);
    DBG("Release: %p %p\n", group->map, group->map+upload->buffers_size);
    assert(group->refcount == 0);
    if (group->transfer)
        pipe_transfer_unmap(upload->pipe, group->transfer);
    if (group->resource)
        pipe_resource_reference(&group->resource, NULL);
    upload->pipe->flush(upload->pipe, NULL, 0);
    group->transfer = NULL;
    group->map = NULL;
}

in nine_buffer_upload.c

Then compile and test with csmt OFF.

axeldavy commented 3 years ago

This is This->buffer_upload = nine_upload_create(This->context.pipe, 4 * 1024 * 1024, 4); Not whatever I did put in the first version of my message (in case you use the mails to patch your code)

dungeon007 commented 3 years ago

Well, fire up what is on your mind. Dont have time right now, but will look couple hours later. 🤣

axeldavy commented 3 years ago

Alright, I'm hoping this patch: https://github.com/iXit/Mesa-3D/commit/e8c93d57445342d612b54257aed71f208d5235be Fixes the performance issues with your games using SYSTEMMEM.

dungeon007 commented 3 years ago

First thing, csmt off: https://i.postimg.cc/yNFhFtz7/first-thing.png gtt goes down, fine... well, still not enough to bring up perf to 60. csmt on, not leaded to corruption, just locked up machine here. 🤣 Anyway, second thing?

dungeon007 commented 3 years ago

"Alright, I'm hoping this patch: iXit/Mesa-3D@e8c93d5 Fixes the performance issues with your games using SYSTEMMEM." So, now we are talking about several things there at the same time 🤣... to apply that systemmem fix to mesa master or as is with ixit tree ?

dungeon007 commented 3 years ago

Ah, you probably mean to fix these flickerings/corruptions, OK that 🤣

axeldavy commented 3 years ago

Nope, I really meant about the performance of the systemmem work of a few days ago

axeldavy commented 3 years ago

First thing, csmt off: https://i.postimg.cc/yNFhFtz7/first-thing.png gtt goes down, fine... well, still not enough to bring up perf to 60.

Well, this part makes sense at least. The part that doesn't make sense is why you don't have the exact same phenomenon when csmt is ON and the path is disabled.

axeldavy commented 3 years ago

Nope, I really meant about the performance of the systemmem work of a few days ago

And the patch had errors. Fixed now.

axeldavy commented 3 years ago

Forget about testing these, It needs serious reworks.

dungeon007 commented 3 years ago

"Alright, I'm hoping this patch:" iXit/Mesa-3D@e8c93d5 Fixes the performance issues with your games using SYSTEMMEM. "Nope, I really meant about the performance of the systemmem work of a few days ago. And the patch had errors. Fixed now." On that point, perf is going back in DS2, WC3... so, consider that as fixed.

dungeon007 commented 3 years ago

"Forget about testing these, It needs serious reworks." He, he, it looked fine to me 🤣

axeldavy commented 3 years ago

Well the range of the data requested to be uploaded was incorrect. It only worked because we uploaded more than needed (the region dirtied by recent locks). If I only upload what I computed as needed for the draw call, it becomes a vertex mess.

dungeon007 commented 3 years ago

Dirty work, anyway just to tell you that i seems dont have GTT issue with PoP with your tree, if with csmt off. That was indifferent with mesa master. Well, it might be same issue like with Unigine. 🤣

axeldavy commented 3 years ago

Alright, what about this branch ? https://github.com/iXit/Mesa-3D/tree/nine_dynamic_systemmem

I don't track anymore regions uploaded. I just upload the data needed for the current draw call even if it was there already. And I use discard/nooverwrite pattern if nothing bad happens for my uploads. Thus rendering shouldn't have artifacts. However performance might be a bit lower than before.

dungeon007 commented 3 years ago

Not good still, there are still artifacts with that one too. Need for Speed Hot Pursuit 2 (d3d8to9), Indiana Jones and The Emperor's Tomb (d3d8to9), Dungeon SIege 1 (ddraw)... in native d3d9 apps didnt spotted it. Seems to happen only via these wrappers, Indiana i had to play somewhere by the end of second level to start to appear... and that vertex mess on the entire screen appeared only if a wanna take gun, that was funny... 🤣

axeldavy commented 3 years ago

Where those artifacts already before the patchset ? They might be unrelated to SYSTEMMEM.

I'm curious if these wrappers have issues with native too

dungeon007 commented 3 years ago

Nope, it is not about these wrappers... Singularity gave me idea, to try some UE3 Engine D3D9 game JUJU and it is full of artifacts too. And before that game goes OOM, crashed up with default, once machine locked up. Lets pass csmt off, lets pass AMD_DEBUG=mono... i mean all of that bullshit mine GTT zero hack avoid all of these OOMs. 🤣

axeldavy commented 3 years ago

Were the artifacts before the patchset or they are new ?

Le mer. 10 mars 2021 à 07:49, dungeon007 notifications@github.com a écrit :

Nope, it is not about these wrappers... Singularity gave me idea, to try some UE3 Engine D3D9 game JUJU and it is full of artifacts too. And before that game goes OOM, crashed up with default, once machine locked up. Lets pass csmt off, lets pass AMD_DEBUG=mono... i mean all of that bullshit mine GTT zero hack avoid all of these OOMs. 🤣

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/iXit/wine-nine-standalone/issues/85#issuecomment-794978784, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATNXMNQQTTOWJIQSZNZV53TC4I5XANCNFSM4QPYDH3Q .

dungeon007 commented 3 years ago

Well, in ixit master or that, mesa master is (still) fine... if patched 🤣

axeldavy commented 3 years ago

Well, that is very weird because as I said, we should be sending the data requested by the draw call everytime...

So it should have been equivalent in terms of data seen by the GPU for its draw call.

I see two possibilities: . The tracking of regions needed by the draw call is still broken . The previous code caused a wait on the GPU when the SYSTEMMEM buffers were locked. Maybe this behaviour fixes sync issues with other pool types.

axeldavy commented 3 years ago

Alright, best performance as of yet for Halo with this new branch: https://github.com/iXit/Mesa-3D/tree/nine_dynamic_systemmem_alternative

I gave up the initial approach of having systemmem locks not dirty the whole buffer (Intel behaviour). But on the other hand, for SYSTEMMEM DYNAMIC, I try to aggressively produce DISCARD/NOOVERWRITE patterns to fill my GPU version of the resource, while avoiding to DISCARD too much. For example, the vertex buffer used for the smoke effect is read out of order, but still generates one discard + many NOOVERWRITE.

dungeon007 commented 3 years ago

Well, still artifacts... well, i think i will upload trace of JUJU - UE3 engine game that shows artifacts, so you can check. 🤣

dungeon007 commented 3 years ago

That dropbox keeps losing connection on uploading a bit larger files, plus mine upload speed isnt great at all, so this took a while... anyway, here is you artifact check from gdrive 🤣 https://drive.google.com/file/d/1GViu6ozLqZ5MhIhEGxQCqRDBArjJZOBV/view?usp=sharing

axeldavy commented 3 years ago

It looks like it's not SYSTEMMEM management but rather the patch st/nine: Optimize DrawPrimitiveUp which has issues

axeldavy commented 3 years ago

Does the branch work now ?

dungeon007 commented 3 years ago

Looks fine now, no artifacts in halo, juju, neither in cases with ddraw, d3d8...

axeldavy commented 3 years ago

What about performance ? Is it as good as the best you achieved with your hacks ?

dungeon007 commented 3 years ago

Not completely, it is good for Blood Rayne which now perform fine... but that was read_write and only what is still so slow (in comparison to windows) is pipe_map_write. Do you want trace the some kind of worst case scenario of that?

axeldavy commented 3 years ago

I don't understand what you are talking about relative to read_write and pipe_map_write. and how fps compares to windows ? Yes if you have a worst case scenario, a trace can help checking the pattern is ok on the affected scene.

dungeon007 commented 3 years ago

Trace of that become like 6.1 GB big and just by entering a menu there, will be mission impossible... would probably try to do it on Windows to see what happens. Everything is and always was slow there on Linux, while on Windows it is for: native d3d8 doing 36 fps d3d8to9 do 26 fps (we should be somewhere there, my hack do 20 fps) And what happens with that on Linux: wined3d doing 4 fps nine 1-2 fps dxvk 0.5 (half of one) fps. Ultimate slowness there 🤣

axeldavy commented 3 years ago

Then give me the log with "csmt_force=0 NINE_DEBUG=vbuf,device" set. Please kill the game (alt-f4) in the affected scene.

dungeon007 commented 3 years ago

No worries, there will be a trace in the next couple hours 🤣 compressed seems will be much smaller, still compressing... wondering if 7z or xz are better to compress this trace.

axeldavy commented 3 years ago

In the end when I run the trace I'll run exactly the trace with these options to get the log. So if the log is much smaller to send, it's a win-win

dungeon007 commented 3 years ago

There you go: https://drive.google.com/file/d/1V_JT0FAPUejBMTXPXzakSBJfoMETEAgd/view?usp=sharing Warning! 6.1 GB trace in it 🤣 Description: it goes from normal performance (logos) to molasses. Ignore logos, after logos it starts. Performance isnt fine already after logos, goes to worse once menu pops up and the worst case scenario by the end and deeper in the menu.

axeldavy commented 3 years ago

Whao so game is doing lock with flags 0, locking the full buffer, of a buffer in D3DPOOL_DEFAULT. it's doing lock/unlock draw lock/unlock draw. The draw uses the same piece of the buffer every draw.

Upon seeing that, I began thinking "I don't do dark magic, this is never going to be fast no matter what, unless per game workaround". Then I had an idea. And check the beginning of the log and saw "nine:device9:ctor: Application asked full Software Vertex Processing" and that explained it all.

I think I'm going to put the DEFAULT pool in SYSTEMMEM when full software vertex processing is requested. With my patchset the performance in the use pattern won't be optimal, but it should still be very good.

axeldavy commented 3 years ago

Alright, there you go. Branch https://github.com/iXit/Mesa-3D/tree/nine_dynamic_systemmem_alternative updated.

dungeon007 commented 3 years ago

And now that is mine speed, just by default and how it should be. 🤣 BTW, csmt off goes on this one for full speed.

axeldavy commented 3 years ago

I'm not surprised csmt off might give a small boost on this one. How is the fps ?

iXit / wine-nine-standalone

Halo CE fps drops when particles are enabled, only occurs on wine-nine #85