multitheftauto / mtasa-blue

Multi Theft Auto is a game engine that incorporates an extendable network play element into a proprietary commercial single-player game.
https://multitheftauto.com
GNU General Public License v3.0
1.38k stars 424 forks source link

SD 22 and SD 23 occurences have increased drastically lately #1340

Open zo3n opened 4 years ago

zo3n commented 4 years ago

Describe the bug I am not really sure how to describe this other than the title. While administering another server I can notice that SD 22 and SD 23 are shown in like 15 players out of 150. This is not normal. SD 22 seems to happen the most. I have recieved the error now myself:

HTTP server file mismatch (admin) admin_spectator.lua [Got size:0 MD5:00000000000000000000000000000000, wanted MD5:8DB609CB32CA48997CE2CF5BAF9DDB7D]
Download error: HTTP server file mismatch (admin) admin_spectator.lua [Got size:0 MD5:00000000000000000000000000000000, wanted MD5:8DB609CB32CA48997CE2CF5BAF9DDB7D]

Last time I encountered HTTP errors like this was in 2015 Some of my friends have also encountered this MD5 0000 thing within last month.

I won't say with absolute certainty that this is not normal, but I have a good feeling that the frequency of it has increased within last week or month. That is the only reason I'm reporting it; so that it can get some attention, or even better, be solved or explained.

To reproduce I don't know

Expected behaviour To not happen

Version I don't know

Additional Information Restarting the resource affected doesn't solve the issue, and still spits out errors:

Unable to delete old file admin\client\gui\admin_spectator.lua
HTTP server file mismatch (admin) admin_spectator.lua [Got size:0 MD5:00000000000000000000000000000000, wanted MD5:8DB609CB32CA48997CE2CF5BAF9DDB7D]
Download error: HTTP server file mismatch (admin) admin_spectator.lua [Got size:0 MD5:00000000000000000000000000000000, wanted MD5:8DB609CB32CA48997CE2CF5BAF9DDB7D]

This results in admin panel being bugged without having any other fix other than restarting the server, which can become a problem if it's a production server.

zo3n commented 4 years ago

image

I have noticed that when i try to delete affected resource from cache, it doesn't allow me to delete it "because it's in use", even if i am not connected to any server

Additionally, when I tried to copy the affected resource (ipb in my newer case) it says that it cannot do that because it is open in Multi Theft Auto WOW64 Helper

ccw808 commented 4 years ago

What happens if you try deleting the individual files in the folder. Do you get the same error for every file?

zo3n commented 4 years ago

Other files (which weren't corrupted) can be deleted successfully, while affected one(s) cannot due to being used by WOW64 Helper

This is what I got when trying to delete hook.lua from ipb

image

ccw808 commented 4 years ago

WOW64 helper should not be using any files. What is the MD5 of MTA\wow64_helper.exe ? What is the MD5 of MTA\libwow64.dll ? What is the version of MTA\libwow64.dll ? (Right click->Properties->Details->File version)

zo3n commented 4 years ago

MTA\wow64_helper.exe: FE8544D6C350C0559741029F8E1A3607 MTA\libwow64.dll: 7324074994D1C9F8850EF56F1D5D3C37 MTA\libwow64.dll Version: 1.5.7.20447

I should note that I have accidentally (brain on autopilot) closed my MTA and now I won't be getting the errors for a while, but I'm sure they'll reappear like they always do lately (saying this in case if there will be another test with cache files, that i'll need to get the error again to do the test)

ccw808 commented 4 years ago

If it happens again, try terminating wow64_helper.exe using Windows Task Manager

qaisjp commented 3 years ago

I just encountered this randomly now, on Multi Theft Auto v1.5.8-release-20704:

image

Extra context:

Pirulax commented 3 years ago

CCheckSum is default initalized to null. (this is why the md5 is full null) And CCheckSum remains null if the file coudn't be opened(due to maybe the HTTP server still writing to it, or something) when the checksum needs to be generated (client side), and we don't retry, thus the generated checksum is null, and it's considered valid, so this pops up. Now, the fix would be to log the error, and retry generating the checksum. I have a pr #1511 which could address this issue.

Dutchman101 commented 3 years ago

This issue is almost guaranteed to happen when you join a server with a decent total download size for the first time.. as in, near completion it will read/write lock a random file, which will remain stuck (and in use) until you restart MTA, even if you remain disconnected from anything for a while. As in, with a decently sized DL, it's probably going to happen to some file.

I can tell the majority of occurences happens in this way - after which player is forced to restart MTA and can then completely download server resources, but in multiple runs. Idk if the file being stuck is a result of bad implementation (e.g download error > file remains in use and locked) or if the download error is a result of the file becoming stuck for some reason.

Anyways, servers using both internal and external HTTP are affected, so we're probably looking for a client downloader bug.

Having a null MD5 string can indicate it opens a handle to the file that's about to be downloaded, but doesn't write the contents to it (due to being read/write locked.. other such cases also show this), after which it also doesn't release the handle so it remains stuck and in use.

qaisjp commented 3 years ago

This issue is almost guaranteed to happen when you join a server with a decent total download size for the first time.

Sounds like a potential race condition between downloader and checker. (I don't know how distinct those two things are in our code.)

player is forced to restart MTA [..] Idk if the file being stuck is a result of bad implementation (e.g download error > file remains in use and locked) or if the download error is a result of the file becoming stuck for some reason.

In my case I just tried to reconnect and it was fine. And yeah I agree that maybe some file handle is being left open somewhere.


One obvious issue is that this code just returns null if File::Fopen returns null for some reason (which will happen for any error and not just the file being missing, as the comment suggests):

https://github.com/multitheftauto/mtasa-blue/blob/a4267de45f24564e02054a99dac13b9149fe6b2d/vendor/bochs/bochs_internal/crc32.cpp#L31-L60

And we just use that result without doing any error checking:

https://github.com/multitheftauto/mtasa-blue/blob/a4267de45f24564e02054a99dac13b9149fe6b2d/Shared/sdk/CChecksum.h#L32-L39

brzys commented 3 years ago

I can say it also depends on HTTP server performance. With internal HTTP server the issue can occur quite frequently, especially with larger downloads. Can't really see this on nginx or Apache.

AfuSensi commented 3 years ago

This has been happening quite regularly on our server, using the internal HTTP server. The only thing that works for our players is removing the affected file from the client's resources folder and reconnecting.

I have confirmed the checksum is correct in both the server's resource and cache directories, but is corrupted on the client's PC.

Unfortunately i am unable to do more tests since i have not found a way to reproduce.

qaisjp commented 3 years ago

One easy to way to trigger the 00000 error message is by following these instructions:

  1. Open PowerShell
  2. Type $file = [System.IO.File]::Open("C:\Users\qaisjp\Documents\mtasa-blue\Bin\mods\deathmatch\resources\freeroam\gui.lua", "Open", "Read", "None")
    • Update the path to your own client resources folder
    • The final parameter "None" means:

      Declines sharing of the current file. Any request to open the file (by this process or another process) will fail until the file is closed.

    • When you are done testing, in your PowerShell window, type $file.close() to close your handle to the file. You don't need to type this command now, but just make sure you don't overwrite the $file variable by executing this command multiple times. (You could could try to executing it again, assigning to a different variable name, and it should say that it can't open the file.)
  3. Make sure you have <enablesd>22</enablesd> in your server config.
  4. Start your server, start the freeroam resource if you've changed the defaults, and join the server
  5. You should get this: image
  6. When you are done testing, in your PowerShell window, type $file.close() to close your handle to the file.

You can also simulate an invalid checksum by simply changing "None" to "Read", which only allows other processes (like MTA) to read the file, but not write to it. (Note that if you do this, you should probably change the content of gui.lua so that it does not just pick up a valid script with a valid checksum.)

I've got a pull request ready that will improve error messages (for now, only for internal HTTP servers), which you'll probably see in the timeline here in the next couple of minutes.

This issue is almost guaranteed to happen when you join a server with a decent total download size for the first time

So some of the errors will be something from https://docs.microsoft.com/en-us/cpp/c-runtime-library/errno-doserrno-sys-errlist-and-sys-nerr?view=msvc-160 so any of these error messages could describe a user's exact issue. E.g. memory problems, too many files open (unlikely since this is apparently done sequentially, see #1511), etc.


When reading files for checksumming, we open it using File::Fopen (see my previous comment). That looks like this:

https://github.com/multitheftauto/mtasa-blue/blob/a4267de45f24564e02054a99dac13b9149fe6b2d/Shared/sdk/SharedUtil.File.hpp#L959-L966

The _SH_DENYNO option means "Permits [other processes] read and write access". In this particular scenario we could use _SH_DENYWR, which allows read, but denies write access. This should help alleviate some issues. This option is only available for Windows -- I don't think Linux fully supports file locking.

Pirulax commented 3 years ago

https://github.com/multitheftauto/mtasa-blue/blob/a4267de45f24564e02054a99dac13b9149fe6b2d/Client/mods/deathmatch/logic/CResource.cpp#L249 dassert is a debug-mode-only assert. So it does nothing in release.

Basically what I think is happening, that a resource include tries to start this resource while it's files are still being downloaded. Then this line https://github.com/multitheftauto/mtasa-blue/blob/a4267de45f24564e02054a99dac13b9149fe6b2d/Client/mods/deathmatch/logic/CResource.cpp#L325-L331 tries to generate the checksum, and it fails, because (probably) the HTTP server has locked the file, because it's writing data to it.

The solution might be to somehow delay the resource loading until CanBeLoaded() returns true, because im pretty sure it returns false. We could queue resources up - so they still start in order - if one of them can' be loaded at the moment.

Edit: Looking thru the comments have confirmed that it really is the HTTP server. As the checksum matches afterwards, and the file can be deleted, which means that it coudn't be opened before because the HTTP server was writing to it.

Pirulax commented 3 years ago

This also seems to happen if theres a space in the filename. Space in the path causes inconsistencies. Eg.: files/vehicle 10.dff would cause the server to not update the file, or something.. Eg.: image

Dutchman101 commented 3 years ago

This also seems to happen if theres a space in the filename. Space in the path causes inconsistencies. Eg.: files/vehicle 10.dff would cause the server to not update the file, or something..

That's issue #1466

Anyways, here's an update.. after 11f94de (PR #1778) the error messages that fit my story: https://github.com/multitheftauto/mtasa-blue/issues/1340#issuecomment-719798067 .. of the most common download failure, which is when you first download a server's resources or need to download a significant update

Have changed to more descriptive: #1002 HTTP server file mismatch! (resourcename) scriptfile.lua [CRC could not open file: Permission denied]

So does "CRC could not open file: Permission denied" mean something, or is the logging that we added with PR #1778 still not sufficient to find out?

It still sounds like affected files are in use (being occupied) by MTA while the HTTP server wants to modify/overwrite it into its final form. Can that be the meaning of "Permission denied" though?

Pirulax commented 2 years ago

We could make the error more specific by using Win32 API stuff to open the file instead of C stdio (So like using CreateFile instead of fopen). Also.. Maybe by using CreateFile we could mitigate the issue, not sure. Yeah, not the most ideal solution, but could work.

Could also use fopen_s which hands back an error code.

Edit: Took a look at it. It uses errno which should (in theory) have the same functionality. It's a simple "access denied", I don't think we can get any more info out of it. We could perhaps.. retry a few times? Instead of just erroring out? Also.. people say only a restart solves the issue, which leads me to believe it's some file handle leak issue in the HTTP server on the client-side.