godotengine / godot

Godot Engine – Multi-platform 2D and 3D game engine
https://godotengine.org
MIT License
91.01k stars 21.17k forks source link

Unable to get file from archive due to ZIPReader.get_files() #92462

Open ZBEP opened 5 months ago

ZBEP commented 5 months ago

Tested versions

System information

Windows 10 - Godot Engine v4.3.dev6.official.89850d553

Issue description

It is not possible to obtain the necessary files from the archive.

When using ZIPReader.get_files(), if the file names in the archive contain ANSI characters, they are replaced with �. This leads to the fact that if the archive contains files whose names differ only in ANSI characters, then only the first file can be opened. The second and subsequent archive files are not opened, instead the first file is opened again (because get_files() resolves the unique name to ����.txt).

The issue could be circumvented if it were possible to obtain the file names as a byte array. Something like ZIPReader.get_files_byte_array()

When the conditions described below are met, I expected the result to be as follows if everything worked correctly:

ТестА.txt -> Text in file 1 ТестБ.txt -> Text in file 2

Steps to reproduce

  1. Create a new project
  2. Create a user interface scene (with a Control node as root)
  3. Attach the following script to the Control
  4. Create a file ТестА.txt with the text: "Text in file 1"
  5. Create a file ТестБ.txt with the text: "Text in file 2"
  6. Add ТестА.txt and ТестБ.txt to the Test.zip archive
  7. Copy the archive to the project root
  8. Run project.
extends Control

# Called when the node enters the scene tree for the first time.
func _ready() -> void:
    var zip = ZIPReader.new()
    if zip.open("res://Test.zip") == OK:
        for file_name in zip.get_files():
            var zip_file = zip.read_file(file_name)
            printerr(file_name, " -> ", zip_file.get_string_from_utf8())

# Called every frame. 'delta' is the elapsed time since the previous frame.
func _process(delta: float) -> void:
    pass

Output:

Godot Engine v4.3.dev6.official.89850d553 - https://godotengine.org OpenGL API 3.3.0 NVIDIA 537.58 - Compatibility - Using Device: NVIDIA - NVIDIA GeForce RTX 4090 Laptop GPU

Unicode parsing error, some characters were replaced with � (U+FFFD): Invalid UTF-8 leading byte (92) Unicode parsing error, some characters were replaced with � (U+FFFD): Invalid UTF-8 leading byte (a5) Unicode parsing error, some characters were replaced with � (U+FFFD): Invalid UTF-8 continuation byte (e1 ... e2 ...) Unicode parsing error, some characters were replaced with � (U+FFFD): Invalid UTF-8 leading byte (80) Unicode parsing error, some characters were replaced with � (U+FFFD): Invalid UTF-8 leading byte (92) Unicode parsing error, some characters were replaced with � (U+FFFD): Invalid UTF-8 leading byte (a5) Unicode parsing error, some characters were replaced with � (U+FFFD): Invalid UTF-8 continuation byte (e1 ... e2 ...) Unicode parsing error, some characters were replaced with � (U+FFFD): Invalid UTF-8 leading byte (81) Unicode parsing error, some characters were replaced with � (U+FFFD): Invalid UTF-8 leading byte (92) Unicode parsing error, some characters were replaced with � (U+FFFD): Invalid UTF-8 leading byte (a5) Unicode parsing error, some characters were replaced with � (U+FFFD): Invalid UTF-8 continuation byte (e1 ... e2 ...) Unicode parsing error, some characters were replaced with � (U+FFFD): Invalid UTF-8 leading byte (80) ����.txt -> Text in file 1 Unicode parsing error, some characters were replaced with � (U+FFFD): Invalid UTF-8 leading byte (92) Unicode parsing error, some characters were replaced with � (U+FFFD): Invalid UTF-8 leading byte (a5) Unicode parsing error, some characters were replaced with � (U+FFFD): Invalid UTF-8 continuation byte (e1 ... e2 ...) Unicode parsing error, some characters were replaced with � (U+FFFD): Invalid UTF-8 leading byte (80) ����.txt -> Text in file 1 --- Debugging process stopped ---

Minimal reproduction project (MRP)

Test_get_files.zip

akien-mga commented 5 months ago

CC @bruvzg

bruvzg commented 5 months ago

if the file names in the archive contain ANSI characters

To be specific, your file contains 0x92, 0xA5, 0xE1, 0xE2, 0x80 as a filename, it's definitely not ANSI, nor it is standard extended ASCII for Cyrillic (CP-1251), seems to be in DOS IBM866 or similar encoding.

None of unzip utils are reading filenames from this ZIP correctly, and either display empty names or same , so it's not a Godot bug. But I guess we should specify that UTF-8 is only supported encoding in the docs.

Adding raw name reading and file opening can be done, but I'm not sure if we should, it won't be frequently used and multiple file names might be confusing for new users. Pretty much everything use UTF-8 and I do not see much reason to support reading ancient DOS archives, if it's necessary it can be done with GDExtension.

ZBEP commented 5 months ago

I understand that the problem is not very common. But if the names were in ANSI, the result would not change, because get_files() would still replace the characters. I don't think get_files_byte_array() or files_to_buffer() will confuse people much. This would immediately solve all problems with all encodings (not just Cyrillic) and would allow people who encounter a problem to solve it without GDExtension. Moreover, this approach is often encountered.

String has many data conversion options and it is convenient.

PackedByteArray to_ascii_buffer()
String to_camel_case()
float to_float()
int to_int()
String to_lower()
String to_pascal_case()
String to_snake_case()
String to_upper()
PackedByteArray to_utf8_buffer()
PackedByteArray to_utf16_buffer()
PackedByteArray to_utf32_buffer()
PackedByteArray to_wchar_buffer()

How is this different from what will fix this problem?

    var zip = ZIPReader.new()
    if zip.open("res://Test.zip") == OK:
        var files_name_byte_array = zip.files_to_buffer()
        for file in files_name_byte_array :
        ...

P.s. My archive was created by the standard Windows archiving utility (Windows 10 Pro 22H2 - 19045.4412). 7z creates the same archive.

bruvzg commented 5 months ago

But if the names were in ANSI, the result would not change, because get_files() would still replace the characters.

There's no such thing as ANSI encoding (at least not a well-defined one, it's used to reference different encodings), ASCII is fully compatible with UTF-8.

String has many data conversion options and it is convenient.

These methods are for well-defined encodings.

In any case, it's easy to implement, so I have opened this PR - https://github.com/godotengine/godot/pull/92484

ZBEP commented 5 months ago

In any case, it's easy to implement, so I have opened this PR - #92484

Thank you very much! This will help me a lot.