Support finer-grained and richer access to tar archives content

Problem statement

As part of the research around improving the handling of large files in k6, we discovered that expanding the internal capabilities of k6 as it interacts with tar archives was key.

To be able to open, seek through, read, and stream the content of files in a memory-efficient and cloud-compatible way, we came to the conclusion we need to be able to perform those operations over a tar archive without decompressing it.

However, the tar file format was designed with tape storage in mind, and the feature set we're looking for either need to have support for in the library we use to interface with it, or needs some work on our side to provide those capabilities to k6.

Knowledge and Assumptions

Some more research and experimentation are needed on that front, but our initial set of assumptions regarding the tar archive format and our use of it in k6 are the following:

k6 uses the tar archive format to store user content (scripts, assets, and configuration)
k6 archives can contain big files (in size) but generally contain a limited number of files.
the tar archive format is an old format designed to store data in contiguous blocks on magnetic tapes
- it was designed with contiguous sequential data storage in mind
- it consists of a sequence of alternating header and data blocks
- it is not indexed; there is no way to know which block contains which file without going through the archive until we find the header referring to the file we're looking for
- as a result, we can't directly seek to read or stream the content of a file within an archive without first going through the file until we find it.
The standard Go library is rightfully so rather low level and we believe it would be beneficial for us to add functionality to support
- directly access a file from the archive's content: as in "let me read file X from archive Y"
- directly seek a position within the archive: as in "move me to byte 128 of file X"

Solution Space

We want to explore adding the ability for k6 to essentially treat tar archives as filesystems of their own, from which we can directly read, stream and seek from and into files.

Some of those functionalities should be possible using the existing tar library, and others need to be built. We believe that to address the larger issue of consuming large files in a portable and memory-efficient manner, we would benefit from providing a higher level API over our tar archives, which would allow us to open, read, seek and stream through, specific files contained in them in a high level, user-friendly (and format agnostic?) manner.

A target API could support (non-binding! all specific names, descriptions, and scope are just here for reference and absolutely non-finale, this is ideation):

some LoadArchive operation which would open the tar archive, and go through it once while building an index of its content
some archive.open(filename string) function which would return a file-descriptor (io.Reader) like handle to a specific file in the archive?
some FileHandle.Read([]byte) (implementing io.Reader) and FileHandle.seek(offset int) operations which would respectively read bytes from the file and allow to move the "reading head" through a specific file in the archive.
some Filehandle.Stream(filename) <- chan byte which would return a channel to consume the data from the file.
some archive.Read(filename string) function which would return the content of a specific file in the archive as a []byte?
some archive.Stream(filename string <- chan byte which would return a stream of data from a specific file in the archive
some archive.Exists(filename string) operation to find out if an archive contains a given file
some archive.List() []FileStat operation to list the archive's content.

All those functionalities could likely be built on the existing libraries.

Edits

14th of March 2023, rephrased, and clarified problematic and scope according to comments from @imiric 🙏🏻

Hey, I have some comments about this, but correct me if I'm wrong.

TAR was essentially designed for streaming files stored on sequential media, such as tapes. So it's a good fit for sequential access, but not so much for random or direct access, since it lacks an index. It's not a compression format, since files are just concatenated sequentially.

as far as we know, the tar archive format and the existing Go libraries do not allow us to

directly access a file from the archive's content, one has to either extract it on disk

The archive/tar stdlib doesn't allow you direct access, true, but the file doesn't need to be extracted to disk. Instead, we need to loop over all headers, and check the filename, as we currently do in lib.ReadArchive.

directly seek a position within the archive: such as the starting position of a file's content

Considering k6 script archives rarely have much more than a few dozen files, if that, I don't see why the above looping method would impact reading performance.

If a file is large, then it can be quickly skipped given that its size is stored in the header. So calling reader.Next() should be a relatively quick operation, though I haven't run any benchmarks to confirm it.

That said, it should be trivial to seek the file before passing it to tar.NewReader(), as shown in this example.

The not so trivial aspect is knowing where to seek, which could be done by maintaining a separate file index. We could store it directly in the TAR when we write the archive, or append it later to any old archives. This Python project does that.

directly read or stream parts of the content of an archive

This should be possible with archive/tar, no? tar.NewReader() returns a *tar.Reader, and we can decide how much to read at a time with Read(p []byte).

So the main feature here, streaming files, should already be possible, while seeking to specific files shouldn't be required for our use case. Am I missing something about what's needed here?

Hey @imiric, you're correct all the way, and most of the comments you've brought have been explicitly discussed during our internal workshop on the large file handling topic with the specific ultimate goal of offering a streaming csv parser.

To be more precise, we don't see the tar file format as the issue, and our goal here could instead be to offer a high-level API that allows us user-friendly fs-like capabilities (opening and reading a specific file from them) when interacting with the content of archives (and to indeed stream specific files from them). We don't doubt that it's already possible, and, if the description made it sound like a solution didn't exist already, then apologies, and it needs rephrasing 😃

This issue was created with the specific optic of offering an interface making it possible to "open" and "stream" the content of a file from a tar archive as if it was on disk.

I'll include your input in the initial description of the issue to reflect those aspects better 🙇🏻

We can probably build all of these things directly on top of the current .tar format, with some extra work. However, it might be simpler and quicker to build them by just moving away from .tar entirely. Besides the simplicity of not having to build all of these APIs on top of an unsuitable format, using another format for script bundles may also bring additional benefits like reduced file size, checksums for error detection, richer signing and metadata support, etc.. Before we commit, we should at least consider moving away from .tar files to something like .zip, or even something like SquashFS or whatever Docker containers use for their image layers.

I think it might be simpler because we would only need to add transparent conversion of old k6 .tar archives to whatever new format we pick once, before a test using an old .tar archive starts. This could be a fairly self-contained part of k6 which won't "infect" the rest of the codebase with convoluted .tar wrapper APIs. And for the few cases where this needs to happen (most people don't execute .tar archives directly), there should only be a negligible performance impact, probably less than the Babel.js impacts we currently suffer from. Considering that the .tar conversion can happen in a streaming fashion, even memory usage shouldn't be that high... :thinking:

Thanks a lot for your feedback @na--

Out of transparency, we touched upon switching the archive format during our workgroup meetings indeed 👍🏻 I think there is no solid opposition for going that route too.

My main observation on that front would be that considering the tar format is used across the whole k6 stack and infrastructure, the cost (in terms of organizing and synchronizing the move) and risk involved in switching might be higher than, say, committing to more minor improvements to the format we use now.

You have a much finer-grained insight than me on what would be involved in such a move, though, so I would trust your judgment.

In terms of cost and risk (time to delivery, potential new problems created by the switch, upkeep cost of dealing with the change), how would you evaluate switching to a new format (regardless of which one) compared to improving what we have now? Would it be worth starting by improving the existing use of tar as proposed in the issue description and then planning for a larger project consisting of switching the archive format completely, or would you say the cost of one versus the other makes it that we'd be better of just going with the new archive format?

I don't know, it depends. My whole point is that we shouldn't discount switching to another format, since that might turn out to be easier. But it might very well not be... I think the right approach here would be a bit of experimentation before we commit to anything. Trying to make a quick and dirty proof of concept with .tar improvements, and a proof of concept for switching to something else, and only then deciding which is the way we want to go.

Or, better yet, we should probably not focus on this issue too much initially. Solving the rest of the problems and prerequisites for https://github.com/grafana/k6/issues/2974 should probably be the focus and leading factor when making decisions. For example, even if we had a 500+ MB .tar archive, if we only read its full contents into memory exactly once, without any further wholesale copying, that would probably be a significant improvement over the current state of affairs and we might not need to do anything about this issue for a while.

To me, the other issues that you added (https://github.com/grafana/k6/issues/2977, https://github.com/grafana/k6/issues/2978) definitely seem more important than this one. If we solve just this issue, solving them will still be required for significant memory savings. If we only solve them, we could probably postpone solving this issue almost indefinitely or get away with only minor improvements :thinking:

(edit: expanded on this in https://github.com/grafana/k6/issues/2974#issuecomment-1495777450)

To focus back on topic: from my memory of the current .tar handling, it was pretty dependent on having a copy of the whole .tar contents in memory as an afero.MemMapFS :disappointed: If it's possible to abstract that away, it might be easier to stick with .tar files, for now. If it isn't, then going with something else that can natively expose a FS interface without reading everything in memory might be better.

However, considering we also cache every file in memory when we execute k6 run script.js, partially so we can then potentially use those MemMapFs file systems to easily create the .tar archive, I am not sure how easy it will be to avoid significant refactoring work either way :disappointed: https://github.com/grafana/k6/issues/1079 is also a factor in making decisions, given the various issues we've had with the afero library before... :disappointed:

grafana / k6