Add a `TARReader` - Githubissues

bend-n commented 1 year ago

Describe the project you are working on

A package manager that works via npm

Describe the problem or limitation you are having in your project

Cant extract tarballs natively

Describe the feature / enhancement and how it helps to overcome the problem or limitation

TARReader as a sibling of ZIPReader

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

var reader := TARReader.new()
var err := reader.open("user://archive.tgz")
if err != OK:
    return err
for f in reader.get_files():
    var contents = reader.open(f)
    FileAcess.write_file(ProjectSettings.globalize_path("res://out_dir".plus_file(f)), contents)
reader.close()
return OK

If this enhancement will not be used often, can it be worked around with a few lines of script?

var output := []
var err := OS.execute(
    "tar",
    [
        "xzf",
        ProjectSettings.globalize_path("user://archive.tgz"),
        "--directory",
            ProjectSettings.globalize_path("res://out_dir")
        ],
        output
    )

if err != 0:
    return ERR_BUG
return OK

sort of

Is there a reason why this should be core and not an add-on in the asset library?

there is a zipreader - why cant there by a tarreader? also making a gdextension for something so small seems like overkill.

YuriSizov commented 1 year ago

there is a zipreader - why cant there by a tarreader? also making a gdextension for something so small seems like overkill.

Neither of these is a good argument to add something to the core, especially something relying on a 3rd party dependency. You can indeed use the OS execute flow, or implement a native module if you need more direct integration. There is no need for this in the engine to justify this, as far as I can see.

bend-n commented 1 year ago

The OS execute flow requires that tar be on the system, usable, and accepting the same arguments. It is not reliable. As for the second part - theres nothing that i can say, except its much easier to use a core module than a native module.

YuriSizov commented 1 year ago

The OS execute flow requires that tar be on the system, usable, and accepting the same arguments. It is not reliable.

That's even more reason not to include this in core: it's not guaranteed to be present in the system, so we must bundle it with the engine. If there is not reason for it in the engine, and there is no compelling argument for it in many projects made with the engine, then it's just bloat aimed to solve an issue in your specific project. That's where plugins, modules, extensions, and custom engine builds should be used.

As for the second part - theres nothing that i can say, except its much easier to use a core module than a native module.

The ZIP module is included because the engine itself uses it. If it wasn't the case, it would be unlikely to be included (though not impossible). We don't add stuff to the core just because it would be nice to have. That's strictly against the principles on which the project is improved.

dalexeev commented 1 year ago

Indeed, there are a lot of archive formats, it makes no sense to add them all to Godot, since Godot is not an archiver program. zip is one of the most popular and widespread formats, which is supported by most archivers, operating systems, and services.

Calinou commented 1 year ago

tar is preinstalled on most Linux distributions, macOS and on Windows 10 or later. You can OS.execute() it with usually the same CLI arguments, at least for simple invocations (Linux uses GNU tar, macOS and Windows use BSD tar).

antonWetzel commented 1 year ago

The tar-reader does not interact with the engine, so maybe just include a compiled tar-reader beside the exported executable.

If you need to use the reader at multiple times in the code, a wrapper script would abstract the underlaying implementation

# not tested, just as a concept

class_name TARReader
extends RefCounted # if we want to mirror the behavior of FileAccess

var path: String = ""
var temporary: bool = false

func open(file_path: String, out_dir: String = "") -> Error # maybe '-> int' is correct
     var err := FileAccess.exist(file_path)
     if err != OK:
         return err
     if out_dir.is_empty():
         out_dir= "user://" + tar_path + "/" # I hope this is a valid directory
         temporary = true
     err = OS.execute(
    "tar",
    [
        "xzf",
        ProjectSettings.globalize_path(file_path),
        "--directory",
        ProjectSettings.globalize_path(out_dir),
    ],
    output,
    )

    if err != OK:
        return err

    path = out_path
    return OK

func get_files() -> PackedStringArray:
    var dir_access = DirAccess.open(path)
    return dir_access.get_files()

func _notification(what: int) -> void:
    match what:
        NOTIFICATION_PREDELETE:
            if temporary:
                DirAccess.remove_absolute(path)

fire commented 1 year ago

To make this work in godot core, we'd need to find a library that implements tar and add it to godot. The library implementing tar should be small and MIT-license compatible.

TARReader cannot be an external executable to be mergable in core.

I haven't researched but not sure the difficult of the tar format to implement since it was made decades ago and it doesn't do compression. It probably is limited in code complexity.

https://github.com/rxi/microtar

Your team might be able to create a native C++ module, but the problem is that as far as I can tell you want this to work on plain Godot Engine releases which won't be compatible.

Using a c++ gdextension creates a chicken and egg problem where you need a loader to load the loader.

bend-n commented 1 year ago

@ifire Im not sure what you mean about the loader to load the loader, does gdextension require a non builtin loader?

fire commented 1 year ago

You'll need to fetch the gdextension via an asset library or via a zip http download. You can't store it as a .tar.gz

bend-n commented 1 year ago

@fire you cant just include the gdextension in the plugin? I see. tis quite the conundrum then.

lyuma commented 1 year ago

The problem with .tgz as a format is it's compressed entirely, not per-file, so you cannot random access files. I'm a little unsure how the API deals with the decompression part of TARReader

In case you're interested, I also have a pure-GDScript implementation here for .unitypackage import since those are .tar.gz internally https://github.com/V-Sekai/unidot_importer/blob/main/tarfile.gd

Unfortunately Godot does not currently have streamable decompression APIs, so I decompress to memory for small files, but large files are forced to use the tar command. This code is here: https://github.com/V-Sekai/unidot_importer/blob/main/unitypackagefile.gd#L47 (Note the whole situation with Windows and needing a --force-local due to archaic tar command behavior and Windows paths containing colons.)

bend-n commented 1 year ago

@lyuma it would seem that more people than one would benefit from the TARReader, then? Quite nice though, will look into it.

lyuma commented 1 year ago

Possibly, but it's very easy to write your own.

The question that I would need an answer to is how the implementation of TARReader would actually work.

for f in reader.get_files():
    var contents = reader.open(f)

If you did this, it could be very inefficient for compressed tar files.

lyuma commented 1 year ago

The only options I would see that make sense are

Don't allow compressed tar files (this might be fine for Godot since resources and .pck are compressed anyway)
Use a different API, so instead of reader.open(f), it just gives all files one by one, in whatever order they are on disk.
Use an API like what I did in Unidot so the entire tar file gets decompressed into memory, and it just crashes randomly or something if you open a large tar file. (But I don't like this idea)
Decompress to a temporary file on disk, (But please be mindful of which drive you use to store decompressed files or expect it to fail on my computer when I tar up my whole project)
Any other ideas?

bend-n commented 1 year ago

microtar has the open() function... Although it doesnt have a list files function...

I like your second idea though, so would it be like,

var tar = TARReader.new()
for f in tar.open("res://file.tar"):
   print("contents of %s: %s" % f)

?

lyuma commented 1 year ago

To be clear, this isn't a question of API, but a question of how the compressed format itself works.

The entire tar archive is compressed all as a single blob, so accessing a file at the end requires reading through and decompressing the whole file again, for every file you open.

microtar doesn't support compressed tar files, so they don't have to worry about this. This issue is specifically a problem with the compressed tar files (.tgz)

SlugFiller commented 1 year ago

tgz is a synonym for .tar.gz, which is just a .tar file compressed with GZip. Godot already contains native support for GZip decompression, through PackedByteArray.decompress with compression_mode set to COMPRESSION_GZIP.

That means you only need to extract the uncompressed tar file. This is actually trivial, since files inside a tar are stored flat, without so much as being divided into packets, so it's simply a matter of reading the header to get the file's size (and, hence, location of the next file), then using PackedByteArray.slice

So, TL;DR it's possible to do this in a few lines of GDScript without using any external programs. There's absolutely no reason to add any functionality in the core for this.

bend-n commented 1 year ago

It would be trivial to add a core TARReader, then, if its so trivial to do in GDScript, which would garner much more performance?

Calinou commented 1 year ago

It would be trivial to add a core TARReader, then, if its so trivial to do in GDScript, which would garner much more performance?

The decompression (where most CPU time is spent) is performed in C++, while the glue code reading the header isn't performance-critical.

bend-n commented 1 year ago

@SlugFiller Could you send a working example of this gdscript tarreader?

SlugFiller commented 1 year ago

@bend-n Godot 3 version:

func untargz(source_file: String) -> Array:
    # Read file into a buffer
    var file : File = File.new()
    if file.open(source_file, File.READ) != OK:
        return null
    var buffer : PoolByteArray = file.get_buffer(file.get_len())
    file.close()

    # UnGZip
    buffer = buffer.decompress_dynamic(-1, PoolByteArray.COMPRESSION_GZIP)

    # Untar
    var ret : Array
    var offset : int
    while offset+512 < buffer.size():
        # Get filesize from header
        var filesize : int = _octal_number_get(buffer.slice(offset+124))
        if filesize < 0 || offset+512+filesize > buffer.size():
            break
        if filesize == 0:
            # Check for two empty records marking the end of the archive
            # This is a relatively expensive operation (for GDScript),
            # but will generally happen only once in the whole archive
            if _check_all_zeros(buffer.slice(offset, offset+512)):
                if offset+1024 > buffer.size() || !_check_all_zeros(buffer.slice(offset+512, offset+1024)):
                    offset += 512
                    continue
                break
        var entry : FileEntry = FileEntry.new()
        # Get filename from header
        entry.name = _trim_null(buffer.slice(offset, offset+100)).get_string_from_utf8()
        # File contents come immediately after the header
        entry.contents = buffer.slice(offset+512, offset+512+filesize)
        ret.push_back(entry)
        # Round to the nearest 512 byte block
        offset += 512+filesize+(511-((filesize+511) & 0x1FF))

    return ret

func _octal_number_get(input: PoolByteArray) -> int:
    var output : int = 0
    for offset in input.size():
        if input[offset] == 0:
            return output
        if input[offset] < 48 || input[offset] > 55:
            return -1
        output *= 8
        output += input[offset]-48
    return output

func _trim_null(input: PoolByteArray) -> PoolByteArray:
    var end : int = input.size()
    while end > 0 && input[end-1] == 0:
        end -= 1
    return input.slice(0, end)

func _check_all_zeros(input: PoolByteArray) -> bool:
    for offset in input.size():
        if input[offset] != 0:
            return false
    return true

class FileEntry:
    var name : String
    var contents : PoolByteArray

Godot 4 version:

func untargz(source_file: String) -> Array[FileEntry]:
    # Read file into a buffer
    var file : FileAccess = FileAccess.open(source_file, FileAccess.READ)
    if file == null:
        return null
    var buffer : PackedByteArray = file.get_buffer(file.get_length())
    file = null

    # UnGZip
    buffer = buffer.decompress_dynamic(-1, PackedByteArray.COMPRESSION_GZIP)

    var ret : Array[FileEntry]
    var offset : int
    while offset+512 < buffer.size():
        # Get filesize from header
        var filesize : int = _octal_number_get(buffer.slice(offset+124))
        if filesize < 0 || offset+512+filesize > buffer.size():
            break
        if filesize == 0:
            # Check for two empty records marking the end of the archive
            # This is a relatively expensive operation (for GDScript),
            # but will generally happen only once in the whole archive
            if _check_all_zeros(buffer.slice(offset, offset+512)):
                if offset+1024 > buffer.size() || !_check_all_zeros(buffer.slice(offset+512, offset+1024)):
                    offset += 512
                    continue
                break
        var entry : FileEntry = FileEntry.new()
        # Get filename from header
        entry.name = _trim_null(buffer.slice(offset, offset+100)).get_string_from_utf8()
        # File contents come immediately after the header
        entry.contents = buffer.slice(offset+512, offset+512+filesize)
        ret.push_back(entry)
        # Round to the nearest 512 byte block
        offset += 512+filesize+(511-((filesize+511) & 0x1FF))

    return ret

func _octal_number_get(input: PackedByteArray) -> int:
    var output : int = 0
    for offset in input.size():
        if input[offset] == 0:
            return output
        if input[offset] < 48 || input[offset] > 55:
            return -1
        output *= 8
        output += input[offset]-48
    return output

func _trim_null(input: PackedByteArray) -> PackedByteArray:
    var end : int = input.size()
    while end > 0 && input[end-1] == 0:
        end -= 1
    return input.slice(0, end)

func _check_all_zeros(input: PackedByteArray) -> bool:
    for offset in input.size():
        if input[offset] != 0:
            return false
    return true

class FileEntry:
    var name : String
    var contents : PackedByteArray

godotengine / godot-proposals

Add a `TARReader` #6089

Describe the project you are working on

Describe the problem or limitation you are having in your project

Describe the feature / enhancement and how it helps to overcome the problem or limitation

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

If this enhancement will not be used often, can it be worked around with a few lines of script?

Is there a reason why this should be core and not an add-on in the asset library?