mamedev / mame

MAME
https://www.mamedev.org/
Other
7.94k stars 1.98k forks source link

chdman createdvd writess corrupt CHD files on CIFS storage #12605

Open denilsonsa opened 1 month ago

denilsonsa commented 1 month ago

MAME version

0.267 (mame0267-dirty)

System information

Manjaro Linux, x86_64

Intel i7-6700 CPU with 8 cores/threads

Storage on a NAS device mounted over CIFS.

INI configuration details

No response

Emulated system/software

No response

Incorrect behaviour

  1. Obtain a Playstation 2 ISO file from a game that was released as DVD.

  2. Convert the DVD ISO file to a CHD file:

    chdman createdvd -i GameNameHere.iso -o GameNameHere.chd
  3. Try to verify it:

    chdman verify -i GameNameHere.iso

What happens:

But the behavior is always consistent on the same .CHD file.

Expected behaviour

It should be able to verify any CHD created via createdvd. It shouldn't crash either.

Steps to reproduce

No response

Additional details

If I were to guess, I would think the createdvd isn't deterministic. There might be some uninitialized data that sometimes gets corrupted and sometimes gets written wrong on the CHD file. Or maybe the file seek/write/sync/flush isn't done correctly. Or perhaps it's related to running the compression in parallel on multiple CPU cores.

Or, I would guess verify code is buggy, as the code shouldn't ever segfault.

Given the random nature of this, it may as well happen that my hardware is defective somehow. Although that's weird, given it correctly verified many CHD files, but only had trouble with a few of the ones I had created using createdvd. The CHD files generated from CD bin/cue files worked just fine.

I know debugging something random like this is a nightmare. I'm sorry about it.


As a guess after some more thought, it could be related to racing conditions on the createdvd method when running in parallel. Since the storage is on a spinning disk in a network device (Synology NAS with limited CPU power), the bottleneck is probably the I/O. Some of the CPU threads might not run in the expected timing due to delays in the I/O. If this is the case, moving the ISO to /tmp (a RAM drive) and creating the CHD over there shouldn't have I/O issues. Well, still a mystery for me, and testing this takes quite some time.

denilsonsa commented 1 month ago

To verify if the issue was related to multi-threading, I tried this creating a CHD file three times with the default number of processes and three times more with a single thread:

for a in 1 2 3 ; do chdman createdvd -i AnyLargePS2DVD.iso -o multi-$a.chd ; done
for a in 1 2 3 ; do chdman createdvd -i AnyLargePS2DVD.iso -np 1 -o single-$a.chd ; done

Both the source and the destination were on the same folder, a CIFS-mounted shared folder from my NAS. The result wasn't good:

$ md5sum multi-* single-*
ea3bc4e3ad10ca387e7ed734501d476a  multi-1.chd
5ab5db891572d1301aa41b3b2624684e  multi-2.chd
ea3bc4e3ad10ca387e7ed734501d476a  multi-3.chd
c610c5723decf413b295b132d51e5ae3  single-1.chd
b2b54cdcd115663df0dbd27ac9afd3c0  single-2.chd
945f851408e0c87df6ad9fe816cd87f9  single-3.chd

Then I decided the same approach, but with the source .ISO file on the CIFS mount and the destination .CHD file on /tmp RAM drive. The result was promising:

$ md5sum /tmp/multi-* /tmp/single-*
ea3bc4e3ad10ca387e7ed734501d476a  /tmp/multi-1.chd
ea3bc4e3ad10ca387e7ed734501d476a  /tmp/multi-2.chd
ea3bc4e3ad10ca387e7ed734501d476a  /tmp/multi-3.chd
ea3bc4e3ad10ca387e7ed734501d476a  /tmp/single-1.chd
ea3bc4e3ad10ca387e7ed734501d476a  /tmp/single-2.chd
ea3bc4e3ad10ca387e7ed734501d476a  /tmp/single-3.chd

I also tried two other combinations of CIFS and RAM as either sources or destinations. Here's the results:

CIFS destination RAM drive destination
CIFS source inconsistent perfect
RAM drive source inconsistent perfect

This makes me believe the process of writing the CHD file has some assumption that isn't working here. Maybe it tries to write and read and seek too quickly, or without flushing the changes. I don't know, I haven't looked at the code.

Given the recent findings, I'm updating this issue title from "chdman can't verify CHD files created by createdvd" to "chdman createdvd writess corrupt CHD files on CIFS storage".


One more note: I know I have created many .CHD files from CUE/BIN CD-based images, and they all verified correctly. This makes me believe the issue is only with createdvd, and not with createcd.

mamehaze commented 1 month ago

have you ensured this isn't a bad RAM / thermal issue? I can believe there could be issues, but typically on machines where I've seen CHD outright create files that don't validate it's been because of bad RAM, or a CPU overheating creating corrupt compressed data. The process is intensive, and will swamp your CPU, pushing all the cores to their limit which for larger files like DVDs could be too much in some cases.

denilsonsa commented 1 month ago

It is unlikely to be a thermal or RAM issue, because:

I cannot 100% discard the possibility of faulty hardware on my side (after all, hardware fails), but in this case I find it very unlikely.

If you have any specific scenario you want me to try (or variations of command-line parameters), please tell me, and I can try running them later.

MooglyGuy commented 1 month ago

If you have any specific scenario you want me to try (or variations of command-line parameters), please tell me

Sure. Reproduce the issue on a different machine, preferably AMD or an earlier generation of Intel.

denilsonsa commented 1 month ago

On the same machine, I tried converting a 600MB Sega CD CUE+BIN to CHD. I used this command:

for a in 1 2 3 4 5 6 7 8 9 ; do chdman createcd -i Foobar.cue -o multi-$a.chd ; done
for a in 1 2 3 4 5 6 7 8 9 ; do chdman createcd -i Foobar.cue -np 1 -o single-$a.chd ; done

Essentially the same command as before, but with createcd instead of createdvd. Also reading from CIFS and writing to CIFS. The result was interesting:

$ md5sum *.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-1.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-2.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-3.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-4.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-5.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-6.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-7.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-8.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-9.chd
7f50d49392430d60a08805e61de64ef5  single-1.chd
cb279c1d5ad9d870d1d7afe23e370063  single-2.chd
4e37f3a44ceb55a58705b27a78c21eba  single-3.chd
cb279c1d5ad9d870d1d7afe23e370063  single-4.chd
2dcc5d096dff53aba89c2768546afafb  single-5.chd
8afd362061362566be291a49479f10cf  single-6.chd
cb279c1d5ad9d870d1d7afe23e370063  single-7.chd
cb279c1d5ad9d870d1d7afe23e370063  single-8.chd
57ca0894cddde1102a238da5aafa8f38  single-9.chd

To make sure overheating wasn't the issue, I repeated the test while setting the CPU power profile to "Power Save" instead of "Balanced", which limits the clock speed to 900MHz (instead of over 2GHz~3GHz). The CPU was at around 45°C this time (instead of over 80°C). This took significantly more time.

$ md5sum *.chd
1c2107afcbdf56bca8a63db6c10bbd21  multi-1.chd
7e3c12a8c919556d6fc164e430b47862  multi-2.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-3.chd
8477cca6863ceba46f59e7d6cc711c45  multi-4.chd
5cf98a8b3cc23cafdca317bf7b6fe27b  multi-5.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-6.chd
7c5a1639db4302dcf2375ed7b177df38  multi-7.chd
c27de612808b53c4ca438d275925cf71  multi-8.chd
cb279c1d5ad9d870d1d7afe23e370063  single-1.chd
cb279c1d5ad9d870d1d7afe23e370063  single-2.chd
f9cca43a3518d718abf5e859bb2e4fc2  single-3.chd
34ddf2875fed771079ad8d1f3b77bbc9  single-4.chd
9e1d8bf395261cba01258623f541149d  single-5.chd
fe481d7eda3cace16f8e2aa2884eaa63  single-6.chd
e756a67d96853a3f5d4ab713edad5f04  single-7.chd

Then I tried the test again, but this time reading from CIFS and writing to /tmp/ (i.e. to a RAM drive).

$ md5sum /tmp/*.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-1.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-2.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-3.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-4.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-5.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-6.chd
cb279c1d5ad9d870d1d7afe23e370063  single-1.chd
cb279c1d5ad9d870d1d7afe23e370063  single-2.chd
cb279c1d5ad9d870d1d7afe23e370063  single-3.chd
cb279c1d5ad9d870d1d7afe23e370063  single-4.chd
cb279c1d5ad9d870d1d7afe23e370063  single-5.chd
cb279c1d5ad9d870d1d7afe23e370063  single-6.chd

Then I repeated the same test on the Steam Deck. Same LAN, both over wired Ethernet (EDIT: the Steam Deck was over Wi-Fi and also over Ethernet), same CIFS/Samba server, same folder.

$ md5sum *.chd
cb279c1d5ad9d870d1d7afe23e370063  deckmulti-1.chd
cb279c1d5ad9d870d1d7afe23e370063  deckmulti-2.chd
cb279c1d5ad9d870d1d7afe23e370063  deckmulti-3.chd
cb279c1d5ad9d870d1d7afe23e370063  deckmulti-4.chd
cb279c1d5ad9d870d1d7afe23e370063  deckmulti-5.chd
cb279c1d5ad9d870d1d7afe23e370063  deckmulti-6.chd
cb279c1d5ad9d870d1d7afe23e370063  deckmulti-7.chd
cb279c1d5ad9d870d1d7afe23e370063  deckmulti-8.chd
cb279c1d5ad9d870d1d7afe23e370063  decksingle-1.chd
cb279c1d5ad9d870d1d7afe23e370063  decksingle-2.chd
cb279c1d5ad9d870d1d7afe23e370063  decksingle-3.chd
cb279c1d5ad9d870d1d7afe23e370063  decksingle-4.chd
cb279c1d5ad9d870d1d7afe23e370063  decksingle-5.chd
cb279c1d5ad9d870d1d7afe23e370063  decksingle-6.chd
cb279c1d5ad9d870d1d7afe23e370063  decksingle-7.chd
cb279c1d5ad9d870d1d7afe23e370063  decksingle-8.chd

Both mounted the CIFS folder using this entry on the /etc/fstab:

//192.168.12.34/Images  /mnt/Images  cifs  credentials=/etc/fstab-cifs-credentials,rw,uid=1000,gid=1000,nobrl  0  0

Slightly different versions:

I'll do even more tests and post the results as I collect them.

denilsonsa commented 1 month ago

This time I repeated the tests on the same intel-based machine, but using the same flatpak version of chdman as I used in the Steam deck. I kept the power profile as "Power Save". Here's the results of createcd from Sega CD:

$ md5sum *.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-1.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-2.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-3.chd
3c41d4b17560b17a3bcfb2b2a8653390  multi-4.chd
1b0ab0f3c07b401b0e12afc2d093ca3f  multi-5.chd
0d8e393115cd09c46b0e8d94495b09ff  multi-6.chd
a80ef3705fdcd2169a3cbc4639c4f6e7  multi-7.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-8.chd
cb279c1d5ad9d870d1d7afe23e370063  single-1.chd
48507f59b9dd73fd1463e66af7983b4e  single-2.chd
79ee00e37c43cf6879530e54d4a477cc  single-3.chd
cb279c1d5ad9d870d1d7afe23e370063  single-4.chd
cb279c1d5ad9d870d1d7afe23e370063  single-5.chd
d9e540c1223c287b120df046f8b28e02  single-6.chd
7edb1555d284ae7f56478e43b78b6aef  single-7.chd
dcec42f4edaf30d3c2eeda723b5b8157  single-8.chd

I also used the Steam Deck to createdvd from a Playstation 2 ISO file. Here's the results:

$ md5sum *.chd
ea3bc4e3ad10ca387e7ed734501d476a  deckmulti-1.chd
ea3bc4e3ad10ca387e7ed734501d476a  deckmulti-2.chd
ea3bc4e3ad10ca387e7ed734501d476a  deckmulti-3.chd
ea3bc4e3ad10ca387e7ed734501d476a  deckmulti-4.chd
ea3bc4e3ad10ca387e7ed734501d476a  deckmulti-5.chd
ea3bc4e3ad10ca387e7ed734501d476a  deckmulti-6.chd
ea3bc4e3ad10ca387e7ed734501d476a  decksingle-1.chd
ea3bc4e3ad10ca387e7ed734501d476a  decksingle-2.chd
ea3bc4e3ad10ca387e7ed734501d476a  decksingle-3.chd
ea3bc4e3ad10ca387e7ed734501d476a  decksingle-4.chd
ea3bc4e3ad10ca387e7ed734501d476a  decksingle-5.chd
ea3bc4e3ad10ca387e7ed734501d476a  decksingle-6.chd

Could it be that writing to the CIFS folder is buggy? I then tried copying one single CHD file from /tmp/ to the CIFS folder multiple times, and every time the file copied perfectly. So, sequential (i.e. bulk) writes to the CIFS folder still works fine. (Which makes sense, as I've never noticed any issue anywhere else.) But the way chdman is writing data leads to a buggy behavior.


Summary so far:

mamehaze commented 1 month ago

have you tried other (non-CHDMAN cases) with the CIFS and large files, under load, with lots of appending of additional data etc.? maybe the OS handling of the CIFS itself is what is buggy?

CHDMAN itself isn't doing anything special in those cases, and it seems to be the common factor.

wilbertpol commented 1 month ago

I am not an expert on CIFS mounts, but other things to try may be not using nolbr and/or cache=none on the CIFS mount.

It might be that the combination of how chdman persists data and the nolbr option are causing the problems that you are experiencing.

denilsonsa commented 1 month ago

I tried remounting with cache=none and re-doing the test. Here's the results:

$ md5sum *.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-1.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-2.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-3.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-4.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-5.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-6.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-7.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-8.chd
cb279c1d5ad9d870d1d7afe23e370063  multi-9.chd

So, indeed caching sounds like it could be the cause.