rust9x / rust

Empowering everyone to build reliable and efficient software, even for Windows 9x/Me/NT/2000/XP/Vista.
https://github.com/rust9x/rust/wiki
Other
323 stars 10 forks source link

RWLock fallbacks seem to cause lockup in Firefox #32

Closed K4sum1 closed 2 weeks ago

K4sum1 commented 3 weeks ago

I tried to fork Rust9x to completely remove TryAcquireSRWLock and either replace the two functions with non-Try versions or critical sections.

However when I look into the code, it seems it tries to use them first, and if it can't find the functions, it then falls back to critical sections.

Something here is funny though, because in Firefox compiled with rust9x on a system without TryAcquireSRWLock but otherwise has SRW locks, the browser just freezes after a minute. If I hex edit out the Try part, the browser just works fine.

So either the fallback fails and causes the whole thing to freeze, or something fails when it hits critical sections, or something else idk.

If this is only an issue with Firefox, please give me some pointers as to how I could trace the issue, because I literally have no mentions of TryAcquireSRWLock anywhere in the source code of my modified version. This call is only from Rust.

Also idk if intentional, but when replicating this commit https://github.com/rust9x/rust/commit/08798a9d3a7c6801ce1d16abd35aef5adbc6581a I noticed that in c.rs when you add TryEnterCriticalSection, you don't also replicate it in windows_sys.rs. Now I notice the code for the function itself looks the same as any other, so I assumed that if it's there in c.rs, it doesn't need to be in windows_sys.rs. However I also notice you just left all the SRW lock and ConditionVariable functions in windows_sys.rs. I'm not sure if this is a mistake or intentional, but it doesn't look right to me.

K4sum1 commented 3 weeks ago

I found a single try_lock call in Firefox, and I was able to easily remove it, but the issue still occurs. I have tested with hex editing out Try that it still works and it's not my fix.

seritools commented 3 weeks ago

However I also notice you just left all the SRW lock and ConditionVariable functions in windows_sys.rs. I'm not sure if this is a mistake or intentional, but it doesn't look right to me.

I keep the changes to windows_sys.lst/rs as small as possible to make rebasing easier in the future. Unused declarations don't make it into the binary.

So either the fallback fails and causes the whole thing to freeze,

You could try forcing different Mutex kinds here and see if the behavior changes: https://github.com/rust9x/rust/blob/4ba01ecb3ada5354050a2ab5e54115dd6d92ebb2/library/std/src/sys/windows/locks/mutex/compat.rs#L27 I have tested the basic lock detection on Win 98SE (CreateMutex), Win XP SP3 (CriticalSection), Win 11 (SRWLocks) via the sample program insofar that they seem to work fine and lock and unlock properly. There have not been any stress tests or anything, so I can't guarantee that there isn't something wrong.

If this is only an issue with Firefox, please give me some pointers as to how I could trace the issue

No idea if it's just firefox, but you could trace the loading/init process and hook the init code or even GetProcAddress to figure out where it loads the SRWLock stuff. Also could just break whenever it accesses one of the sync functions you're interested in.

Also idk if intentional, but when replicating this commit https://github.com/rust9x/rust/commit/08798a9d3a7c6801ce1d16abd35aef5adbc6581a I noticed that in c.rs when you add TryEnterCriticalSection, you don't also replicate it in windows_sys.rs.

Only things that will be added to the statically dllimported (= land in the import table header of the binary) have to be added to windows_sys.lst/rs. Since TryEnterCriticalSection is optional and not exported on all systems, it is loaded at runtime. If i didn't do that, binaries would fail loading with "Import kernel32!TryEnterCriticalSection not found" or whatever the error message from the windows loader is.

Notice how every function (excluding WinSock) I added to windows_sys.lst is available on every single 32-bit Windows kernel.

K4sum1 commented 3 weeks ago

I keep the changes to windows_sys.lst/rs as small as possible to make rebasing easier in the future. Unused declarations don't make it into the binary.

Only things that will be added to the statically dllimported (= land in the import table header of the binary) have to be added to windows_sys.lst/rs. Since TryEnterCriticalSection is optional and not exported on all systems, it is loaded at runtime. If i didn't do that, binaries would fail loading with "Import kernel32!TryEnterCriticalSection not found" or whatever the error message from the windows loader is.

Notice how every function (excluding WinSock) I added to windows_sys.lst is available on every single 32-bit Windows kernel.

These seem to conflict, but I guess it works.

You could try forcing different Mutex kinds here and see if the behavior changes:

Next time I build Rust, I'll try that.

No idea if it's just firefox, but you could trace the loading/init process and hook the init code or even GetProcAddress to figure out where it loads the SRWLock stuff. Also could just break whenever it accesses one of the sync functions you're interested in.

Debug builds don't work too well on Vista, but I figure worst case I can just malform the TryAcquireSRWLock functions by hex editing to not match on 7+ and try that way.

seritools commented 3 weeks ago

Let me know if you figure out more! 🙏

K4sum1 commented 3 weeks ago

I force set it to CriticalSection, and I have the issue on both Vista and 10, and hex editing the TryAcquireSRWLock call does not fix it.

I'll try Legacy tomorrow, I am too tired right now.

K4sum1 commented 3 weeks ago

Legacy mutex's seemed to last longer, but I still got the same problem. The browser works from anywhere from a few seconds to a few minutes before freezing.

I did get the funniest error in testing though

[Child 6968, MediaDecoderStateMachine #1] WARNING: 1765e560 Could not set cubeb stream name.: file C:/Users/Admin/Documents/GitHub/r3dfox/dom/media/AudioStream.cpp:321
[Child 6968, MediaDecoderStateMachine #1] WARNING: 1ca72e80 Could not set cubeb stream name.: file C:/Users/Admin/Documents/GitHub/r3dfox/dom/media/AudioStream.cpp:321
CCCCCCrrrrrraaaaaasssssshhhhhh      AAAAAAnnnnnnnnnnnnoooooottttttaaaaaattttttiiiiiioooooonnnnnn      GGGGGGrrrrrraaaaaapppppphhhhhhiiiiiiccccccssssssCCCCCCrrrrrriiiiiittttttiiiiiiccccccaaaaaallllllEEEEEErrrrrrrrrrroorooorrorrr::r:::  :   || |||[[|[[[CC[CCC00C000]]0]]][[][[[GG[GGGFFGFFFXXFXX
X11X111--1---]]-]]]::]:::  :   CC CCCooCooommommmppmpppoopooossossisisiititittototoorororrBrBrBBrBrBrrirriidiiiddgddgdgeggegeCeCeeChhCCCihhihlhiilidildll l dddrdr   e errrcrceeeeeecccicieeevveiieieivvsvsvee e essIsIs  P P IICICIPP  PPCcCcCCl  l  occoccsllslleooeoo ss ssweeweei  i  twwtwwhiih
ii tt ttrhhrhhe e   ararrsseerooeaaennassa==soosAAonnobb=n=nnnA=A=oobAbArrnbnbmmononaaorrollrmmrSSmaamhhallauulSSlttShhSddhuuhoouttuwwtddtnndood  owwo((wnnwttn  n== (( 11(tt(85t==t78=11=912781..585877334758...))6161  .7519[[9))5GG)  )FF [ [XX[G[11GGFG--FFXF]]XX1X::11-  1--]CC-]]:oo]:: mm:  C
pp CCooCooosmmomsipmppiotpootsoossoirsiirtBitBtortororiorirBdrBdBrgBrgrierieidCidCdghdghgeigeieCleClChdChdhi hi ilrilrldelded cd c re rereireiecvecvceeceeeiseisiv iv veIveIesPesPs Cs C I  I IPcIPcPClPCl oCC ocs  csleccleo llo swooswiesseit ee thw  wh iwwi rtiitrehtthea hh asr  rsoerreonaeean
=asa=sAsosAobonobnnn=nn=o=A=oArAbrAbmbnmbnanoanolorlorSrmSrmhmahmuaaluatllStldSShdSohhuohwuutwunttdnt
ddo
odwowonwnw
n
n
seritools commented 3 weeks ago

I wonder if Firefox (probably rightly?) assumes that an RWLock can be read-locked multiple times. The fallbacks to a critical section or mutex of course won't allow that, which might cause a deadlock. In that case Rust9x would need to implement a completely separate fallback RWLock implementation, e.g. based on CreateSemaphore (write lock = acquiring tickets/counts) or other primitives supported on older systems (maybe something from here works)

K4sum1 commented 3 weeks ago

In that case Rust9x would need to implement a completely separate fallback RWLock implementation, e.g. based on CreateSemaphore (write lock = acquiring tickets/counts) or other primitives supported on older systems (maybe something from here works)

You might be right, I was also experimenting with the Rust used to create Mypal68 (Firefox 68 for XP) and I noticed Semaphores are used in it. If you want to see the patch for it, here it is.

https://github.com/Eclipse-Community/eRust/commit/a7d5b8cabb9911e293c5329b2dd78ba26af688cc

It is 1.45.2 though, and I think I would need at minimum 1.66.0 for 115, and 1.76.0 for 128. I originally wanted to make a fork of 1.77.2 since that works for both 115 and 128, but I couldn't figure out how to port the changes to 1.77.2, however I was kinda doing it wrong in the first place so I could probably do it if I tried again.

seritools commented 3 weeks ago

Some of the implementations in that commit look super interesting!

The RWLock impl there however doesn't seem to use semaphores or allow multiple read locks at the same time either, it also just falls back to std's old ReentrantMutex implementation. That implementation doesn't exist anymore in current Rust std (PR). That old reentrant mutex impl used critical sections, but the wrapper in SRWLock around it prevents reentrant locking too.

K4sum1 commented 3 weeks ago

I also suspect that the freeze is GFX related, and I know modern Firefox uses WebRender, which Mypal68 doesn't have. I would love to try building an older Firefox version that still allowed for disabling WebRender, however finding all the dependencies for that sounds like hell.

seritools commented 3 weeks ago

Hmmm, I'm pretty sure my current RwLock fallback implementation is just wrong (and the one in your linked commit too!). This example should panic immmediately, as the fallbacks guard against reentrancy. I've created #33 for it.

Not sure when I get to updating rust9x though at the moment.

K4sum1 commented 3 weeks ago

I guess I can look at the rwlock implementations in Firefox and see if anything would cause a crash or have this issue.

K4sum1 commented 2 weeks ago

I don't know what I am looking for.

I've traced it to this file as it's the only WebRender file to use std::sync::RwLock, but I'm not sure what in it would be.

https://github.com/mozilla/gecko-dev/blob/release/gfx/wr/wr_glyph_rasterizer/src/rasterizer.rs#L28

seritools commented 2 weeks ago

I mean the RWLock fallback implementations are just incorrect.

You could try replacing the Arc<RwLock<..>> with an Arc<Mutex<...>> and test again. If the broken RWLock fallback is the cause, the Arc Mutex variant should also lock up on modern systems.

K4sum1 commented 2 weeks ago

I mean the RWLock fallback implementations are just incorrect.

Well, yeah, but it still partially works. So I would want to mitigate the issue for now.