NBT does not use UTF-8, it's MUTF-8.

TkTech commented 3 years ago

NBT uses MUTF-8, not UTF-8. Valid game-generated files will result in UnicodeDecodeErrorswhen using Twoolie's NBT. Minimal reproduction file with an embedded MUTF-8 NULL: encoded.dat.gz

I'd normally send you a PR to use my MUTF-8 encoder, but being dependency-free seems to be a project goal. There's a pure-python version in there you can just copy.

@1dt

macfreek commented 3 years ago

@TkTech Thanks for the report! And great suggestion, I would support an fix. (Unfortunately, I'm not actively maintaining this package anymore, so will do not myself, at least not anytime soon I'm afraid).

Interesting topic, I thought I'd seen it all after the different line-endings, different normalizations in UTF, and the BOM-or-no-BOM. Another variant I was not aware of. Seems like one of the original Java programmers had a field day torturing the original UTF-8 and UTF-16 specs. Alas, that's what we have to deal with.

[Edit, seems I was wrong at first]

Just for the record, am I correct to assume that:

This is the format implemented by yourself (code) and py2jdbc (code)
The format is the one described by https://docs.oracle.com/javase/8/docs/api/java/io/DataInput.html and https://py2jdbc.readthedocs.io/en/latest/mutf8.html (the docs for py2jdbc)

And if I'm correct, there are two difference between this format and UTF-8

It encodes U+0000 in 2 bytes instead of in 1 byte (also as described in the rejected Python issue 2857)
It encodes codepoints outside the basic multilingual plane (thus code points >= U+10000) as 6 bytes instead of 4 bytes, like CESU-8 does, and as described by Unicode Technical Report #26 .

TkTech commented 3 years ago

That would be correct. I know it can be confusing, especially with some of the first posts you see (such as stackoverflow) when searching just suggest replacing NULLs, which is incorrect.

TkTech commented 3 years ago

Since this library wants to be dependency free, and py2.7 compatible (which mutf8 is not), instead of a PR here's a patch anyone stumbling on this with an unreadable file can use (as long as you're py3):

diff --git a/nbt/nbt.py b/nbt/nbt.py
index 947a65e..8f633bd 100644
--- a/nbt/nbt.py
+++ b/nbt/nbt.py
@@ -4,12 +4,13 @@ Handle the NBT (Named Binary Tag) data format
 For more information about the NBT format:
 https://minecraft.gamepedia.com/NBT_format
 """
-
 from struct import Struct, error as StructError
 from gzip import GzipFile
 from collections import MutableMapping, MutableSequence, Sequence
 import sys

+import mutf8
+
 _PY3 = sys.version_info >= (3,)
 if _PY3:
     unicode = str
@@ -353,10 +354,10 @@ class TAG_String(TAG, Sequence):
         read = buffer.read(length.value)
         if len(read) != length.value:
             raise StructError()
-        self.value = read.decode("utf-8")
+        self.value = mutf8.decode_modified_utf8(read)

     def _render_buffer(self, buffer):
-        save_val = self.value.encode("utf-8")
+        save_val = mutf8.encode_modified_utf8(self.value)
         length = TAG_Short(len(save_val))
         length._render_buffer(buffer)
         buffer.write(save_val)
diff --git a/setup.py b/setup.py
index e6a7cd5..4338408 100755
--- a/setup.py
+++ b/setup.py
@@ -13,6 +13,7 @@ setup(
   license          = open("LICENSE.txt").read(),
   long_description = open("README.txt").read(),
   packages         = ['nbt'],
+  install_requires = ['mutf8'],
   classifiers      = [
         "Development Status :: 5 - Production/Stable",
         "Intended Audience :: Developers",

ghost commented 2 years ago

Alright, so it looks like I'll need to fork the project and apply this patch. I currently have my block entity scanning script crash when trying to iterate over HermitCraft Season 7's world due to this bug.

ghost commented 2 years ago

Turns out Hermitcraft 6 has a region file which breaks mutf8. I've isolated the broken region file to be r.14.-2.mca. I'll have to see if I can isolate which block entity is causing the breakage so I can figure out why (e.g. if it's corruption or just mutf8 not being able to read it properly). Edit: It's in the overworld chunks.

Screen Shot 2021-10-14 at 4 42 55 PM

ghost commented 2 years ago

Interestingly enough, the message doesn't show up in the mutf8 source code. However, I believe the message is supposed to come from line 65 of mutf8.py as it's the message with the if statement that checks for byte 0xED. Edit: Yep, it's line 65. I deleted the (cython?) binary and it gave me the exact message from source plus the line number 65.

The problem with the way the exceptions are handled right now, is, I have no way of finding out which chunk or block is corrupted. So, I have to play guess which block's corrupted as I cannot get the coordinates from the stack trace. I'm about to wipe out every block that isn't bedrock or marked as axe mineable (for chests). If the error goes away, then it'll most likely be a hopper, furnace, blast furnace, etc... I'm thinking to make it easier to debug, it might do me well to create a datapack that includes every vanilla block that isn't a block entity and use that for wiping out blocks. It may not even be a block that causing the exception, but I can't load the region file in https://irath96.github.io/webNBT/ or in the NBT Plugin I have installed into Intellij Idea. So, it may do well to build an editor that can log every exception without breaking the chunk loop.

I should also mention, neither Minecraft, nor Amulet detects a problem with loading these chunks. Even using the Optimize World feature to upgrade the region file from 1.14.4 to 1.17.1 doesn't fix the issue. So, most likely, the data is valid, just that mutf8 can't handle it.

Screen Shot 2021-10-14 at 4 50 23 PM

TkTech commented 2 years ago

Can you attach the region file? It may or may not be an issue with mutf8, might just be a genuinely corrupted tag.

ghost commented 2 years ago

Yes. I've located the chest causing the problem too. Turns out Docm had created a series of books with weird characters and named them Alien Tech. Even minecraft froze for a second when I ran the /data get block ... command on the chest.

I copied the chest to every hotbar save, so you can run something like x+1 to get the chest which breaks mutf8.

Test Corruption.zip

hotbar.nbt.zip

2021-10-14_19 27 11

2021-10-14_19 28 07

ghost commented 2 years ago

As I may have accidentally uploaded a copy post my breaking the chest (to confirm that the problem was with the chest), here's an unedited copy of the region file. Also, here's the command which has the chest's coordinates in it. /data get block 7580 68 -976.

r.14.-2.mca.zip

Edit: If it helps, this is the scanner I'm working on (https://github.com/alexis-evelyn/WorldScanner/blob/master/scanner.py). I'm currently using the patch you provided at https://github.com/twoolie/NBT/issues/144#issuecomment-765184108. I have not uploaded the patched version of NBT yet, but can do so if you don't already have a fork that has a patch (If I add my own patches, then I can include yours too if you'd like).

Netherwhal commented 1 year ago

Still getting the same issue though:

UnicodeDecodeError: 'mutf-8' codec can't decode byte 0xed in position 630: 6-byte codepoint started, but input too short to finish.

Offroaders123 commented 3 months ago

Just wanted to stop by and say thanks for documenting this! I'm working on an NBT library as well, and MUTF-8 does have a notable difference in output for the character ranges that it handles compared to UTF-8. Having that hotbar.nbt file to test on really helped with ensuring that it works.

twoolie / NBT

NBT does not use UTF-8, it's MUTF-8. #144