hexops / mach

zig game engine & graphics toolkit
https://machengine.org
Other
3.37k stars 160 forks source link

Archive / pack file support #864

Open slimsag opened 1 year ago

slimsag commented 1 year ago

Hey, I'm looking to dig around and hope to play with a from-scratch 3D project using mach-core so that I can learn the codebase and hopefully contribute. Are there any plans to include a gamedev-oriented file system e.g. wrapper for PhysFS in mach-core or as a related module?

I'd like to have something like this - yes. No immediate plans to work on this (since we have higher priority things before then), but it would be written in Zig when we do (rather than say bindings to PhysFS) and the design may be a bit different. it wouldn't go in mach-core, since mach-core aims to be very minimal. Just window+input+GPU. Rather it'd go in a separate package/module somewhere else.

Cool. I imagine it's not something you've thought about a whole lot yet, but what sorts of design considerations/changes would you consider important? I need something of the sort and I'd love for at least a little of the work involved to be cannibalizable for mach when the time comes.

Good question; I definitely haven't thought about it super in-depth, but could start doing so if you're eager to work on something like this in the mach codebase (which I'd be happy to have). on the surface, I see a few things here

  1. is the overlap (and consistent experience) we would want to provide here between desktop/mobile/web. Obviously we wouldn't need to support all 3 of those targets initially, but whatever API we come up with would need to support those reasonably eventually
    • for desktop, the primary concern would be performance
    • for mobile, the primary concern would be 'after installation' downloads/patching
    • for web, the primary concern would be compatibility with HTTP (byte range requests are possible, so the format needs to be compatible with those in some form maybe) (2) There is increasing support from graphics hardware manufacturers to enable GPUs to directly consume data from disk. For example DirectStorage with d3d, and there's a similar extension for Vulkan which AMD released recently.
    • The advice/approach with that is: store your files in GDeflate (similar to deflate compression) format, and then you can yeet the data over to the GPU and it can handle decoding the file. But it requires your data be available in some format that you can yeet it to the GPU like that. I think a pack file which looks like:
    • metadata (describes what's inside)
    • a bunch of independent regions (file bytes) which are compressed with GDeflate or zstd depending on what's inside
  2. the third concern is making it nice to develop/work with, either by using the native filesystem when developing (instead of a pack file/archive) - or by having nice tooling to interact with it somehow

so, stepping back from those high-level thoughts, I think the first implementation of this could be something like: a library which implements an archive file format which has metadata first, and zstd compressed file regions afterwards + some tooling to create it and develop/work with it nicely.


once such a thing exists, an 'overlay' concept / PATH analog ('you load one or more archives into a sort of PATH analog, which makes modding and patching completely transparent.') would be very nice to have, I would also very much like that as a way to support modding nicely

spindlebink commented 1 year ago

There are a few layers of a game dev oriented filesystem that I'm interested in exploring. This issue addresses one of the foundational layers, an archive file format, but in the medium/long term here's what I'm hoping to go for. Feedback is welcome.

As far as the archive format itself goes, I've currently prototyped a writer but no reader yet for a simple archive format I'm currently calling mach-pck, since pck is commonly used as a generic archive extension. I'd appreciate better/more specific name suggestions.

The format is just a header followed by a list of data blocks, each of which report their name, checksum, compression mode, and the byte range in the file at which their data can be found. The writer currently concatenates all file data into one big blob at the file's end, but since each file block just reports the range in the file that you can find the data, it's also possible to interlace file data with descriptors which would make archiving files sequentially easier and possibly use less memory.

I'm currently working on a reader for the format and a CLI tool to pack files into it. Most compression methods will need to be pre-compressed before compiling the archive, but I'll just use the stdlib's decompression routines when reading.

slimsag commented 1 year ago

@spindlebink in general this sounds like a good direction to me, and the file format sounds correct as well - but the devil is in the details. I'd encourage sending these changes one at a time very incrementally to the main repo so we can start to integrate them and make sure we see eye to eye as we go so to speak. I also imagine the CLI can be part of the mach editor CLI here: https://github.com/hexops/mach/tree/main/src/editor

Does that sound like a good starting point?


Some other thoughts:

  1. I saw recently that others in the industry (Crash bandicoot) are using GDeflate (for GPU-readable assets), and LZ4 compression for CPU-bound assets because, although it has worse compression than e.g. zstd, has better bandwidth throughput.
  2. I saw this which shed light on some interesting tradeoffs with basisu/KTX2 textures and gdeflate (in short, if we use ktx2, we must decode textures on the CPU before handing them to the GPU anyway.) I still think we should just use basis/KTX2 for textures for now.

So I think the ideal end state for files stored in a .pck file would be:

The only question would be whether we employ some compression for the 'header' of the file with all the metadata. I think probably this is a good reason to keep the metadata at the start of the file, and not scattered, as it keeps this option open.

spindlebink commented 1 year ago

The tag for the decompression method is per file block, just an enum(u8), which means adding decompression methods for specific endpoints only requires implementing it in the reader and adding an enum member. Designing a file import workflow is important, so I'll open a new issue for that. I'm trying to keep the archive format generalized, since design work there will reflect on design work here.

whether we employ some compression for the header of the file

Right now, the header contains no structural information about the file other than the total length and the interlacing mode, which is necessary for the reader to know where to start reading the next block [*]. Directly following the header are individual data blocks prefixed by a block type (enum(u8) again), which is where the archive stores file info.

header: signature version block_mode body_len
file block: filename range [...]
file block: filename range [...]
file block: filename range [...]
big binary blob: ............

It's easy to add a CompressedBlockRange block indicating a range to decompress and read as a list of blocks:

header: signature version block_mode body_len
compressed block range: compression_mode range
    (when uncompressed)
    file block: filename range [...]
    file block: filename range [...]
file block: filename range [...]
big binary blob: ............

Then when the reader encounters a compressed block range block type, it slices out the range indicated, decompresses it, and uses the same archive reading routines to parse the blocks inside it.


A concern: files will need some information provided per file when packing them (i.e. compression mode, file name relative to the archive if it needs to be different from cwd, more if the import pipeline needs it). Specifying that information over the CLI every time the user builds an archive means that the CLI will become unwieldy for anything of scale.

It could be cool to include archive manifest information in the build.zig as part of the build step, but A) this possibly implies repacking every archive every build and B) this makes it more of a pain to build archives outside of the game source tree (e.g. for asset mods and more generally for keeping the artist workflow distinct from the code workflow).

My proposal instead is that the archive CLI rely on a manifest file which is passed to the CLI when packing. I don't currently see a way to parse .zon via the Zig standard library, so in the meantime I'll use INI or JSON or something.


[*] It occurs to me that including the offset of the next block in each block would do away with needing to differentiate between interlaced and non-interlaced modes entirely and also help with both version incompatibility warnings and validation. There might be some issues there I'm not thinking of, though, so I'll think on it.

slimsag commented 1 year ago

Right now, the header contains no structural information about the file other than the total length and the interlacing mode [...] A concern: files in this archive format need some per-file information attached to them (i.e. compression mode, file name relative to the archive if it needs to be different from cwd, more if the import pipeline needs it).

It is this information that I'd expect to be in the header of the file, and which I'd like to see compressed

Specifying that information means a purely CLI approach to archive management would be unwieldy to use for anything of size.

I don't think so, I would imagine the CLI could expose unix-like file commands. e.g. ls, mv, cp, touch, stat which would decompress+read the header and perform operations, updating/writing to the pack file if needed.

My proposal instead is that the archive CLI rely on a manifest file which is passed to the CLI when packing. I don't currently see a way to parse .zon via the Zig standard library, so in the meantime I'll use INI or JSON or something.

The manifest being a separate file (or perhaps just a chunk at the end of the file) is interesting.

It might be worth thinking about this as two concepts: one blob of data that is just ranges of file bytes, one blob of data that is describing everything about those ranges of bytes (file name, modtime, compression type, etc.)

I would suggest this: a single file which has a layout like:

binary data length (`u64`)
binary data (a big `[]const u8`): (zero metadata, just arbitrary bytes)
  [file1]
  [file2]
  [file3]
  [...]
metadata length (`u32`)
metadata (whatever type is needed):
  [file1 byte range]
  [file1 compression type]
  [file1 name]
  [file1 ...]

I'd also suggest using a binary format for the metadata instead of JSON/INI/whatever.

When implementing tools, we'd simply read the binary data length and seek/skip over that many bytes to get to the metadata, at which point you can find any file in the pack. To add files, you would just trim the metadata off the end of the file, append the new file, and write the updated metadata. To delete files, you could zero the bytes and have a special metadata field that marks the file as 'deleted' until you run a garbage-collection operation that rewrites the whole file without that byte range or such.

spindlebink commented 1 year ago

Sorry, reading back I realize I wasn't very clear. I was talking about two different subjects--the metadata for each file and its byte range is currently stored in binary in the file itself, and can definitely be compressed. The manifest would be to describe the file tree pre-packing, such that packing assets into the archive could be a single step done when building a release game package.

Unix-style commands for the CLI make a lot of sense, the problem I was trying to solve in the latter half of my comment was that the process of updating the packfile every time an asset changes and re-supplying the information could get tedious for projects of scale.

slimsag commented 1 year ago

Gotcha, that makes more sense. I understand what you meant now.

What about a model of say: create an archive from <this directory> and everything from that directory gets included, no manifest required? We could support a .machignore file to allow you to exclude certain files

spindlebink commented 1 year ago

Oooh, that'd be super clean. I guess in that case we'd rely on opinionated defaults (a la your comment above) for most file types (identified via extension), then maybe support a means to override it in specific cases like Godot's .import files?

slimsag commented 1 year ago

Yeah, that sounds reasonable on the surface 👍 In general I think opinionated defaults are best