Questions about tpxz / file index format

Rogdham commented 3 years ago

Hello,

I am really interested in the “tpxz” format, i.e. indexed tar.xz format used by pixz.

I am planning on implementing a Python library that would support that format, and I have written below a few questions, I hope that you will have some time to answer them @vasi!

0. Popularity of tpxz

Is the format really named “tpxz”? What does that stand for?

Do you have any insights of how much tpxz format is used?

Do you know if reader or writers of that format have been written in other languages?

I think it is the main selling point of pixz (out of the one listed in the readme anyways), now that xz supports multi-threading itself. Do you know if people still use pixz just for the tpxz feature?

1. Structure of pixz's tar file index

Here is my understanding of the file index, did I get it right?

file index:
+-+-+-+-+-+-+-+-+==============+     +==============+-+-+-+-+-+-+-+-+-+
|  index magic  | index record | ... | index record |  index footer   |
+-+-+-+-+-+-+-+-+==============+     +==============+-+-+-+-+-+-+-+-+-+

index record:
+==========+-----------+-+-+-+-+-+-+-+-+
| filename | null byte |     offset    |
+==========+-----------+-+-+-+-+-+-+-+-+

index footer:
+-----------+-+-+-+-+-+-+-+-+
| null byte |     offset    |
+-----------+-+-+-+-+-+-+-+-+

Where:

index magic (8 bytes): the constant \xa6L2.\xd6\x14\xae\xdb
index record is defined as follows:
- filename: name of the file in the index (without the null byte), cannot be empty
- offset (8 bytes): 64-bits integer packed as low-endian, pointing to the tar header block
index footer is defined as follows:
- offset (8 bytes): 64-bits integer packed as low-endian, pointing to the start of the index magic

2. Position of file index

The file index is appended just after the tar data. No padding is added, right?

Also, when reading the file index, do you just do the following?

go to 8 bytes before the end of file
read the 8 bytes as 64-bit integer
seek to that offset
you are now at the beginning of the file index

An alternative would be to go to the start of the last XZ block, but it may be less reliable (see question 6b).

3. File index alone in one XZ block?

I noticed that the whole file index is alone inside the last xz block. It could have been appended to the tar data in the previous block. Is there any reason for that? Is it mandatory?

I first thought that maybe this would allow you to drop it easily to write more data to the tar archive, but it seems not to be the case as you would need to read the last block of tar data anyways.

4. Null byte at the start of the index footer

I noticed that the null byte at the end of the index footer is used in the code as a marker to stop the list of index records.

However, I feel like it would not be really needed, as you know that you reached the end of the index record when you are at the end of the stream minus 9 bytes.

It's not really a question, but just to make sure that I did not miss anything.

5. Filenames occurring several times in archive

Nothing special happens here, each occurrence is stored in the index file; pixz -l will list them all, in the same order as they appear in the tar:

$ ### example file ###
$ base64 -d << EOF > a.tpxz
/Td6WFoAAAFpIt42A8B5gFAhARYAAAAAm+b72eAn/wBxXQAwi4qHxA7yl6T4dT3Fqe3Zl6mYJcaN
Zum5yTtyoaU5BF/n144YGFDDP0Cb9DMh0wr9fnpMBYHquIISbxNF/67v2icGv7eRoa7Pyl2xY1uQ
vRZgKlQ7XtUBsswq41/oV9JRQe9wRnv8PXhJ2s1J1EsAAAAAAACbWW+eAgAhARYAAAB0L+Wj4AA6
ACBdAFMTAkLm8KfZ0WA6AnzzrMxoWqk8HuB5cDMJi0pKC7XAAA4u0rgAAo0BgFA4O6Itgko+MA2L
AgAAAAABWVo=
EOF
$ sha1sum a.tpxz
d9e6a30ba77216bd7613ba46bf5586819a5758a6  a.tpxz

$ pixz -l a.tpxz
a.txt
b.txt
a.txt

$ # here yo can see the offsets are in order: 0x0, 0x400, 0x800
$ unxz < a.tpxz | (head -c 10240 >/dev/null; hexdump -C) # only show file index
00000000  a6 4c 32 2e d6 14 ae db  61 2e 74 78 74 00 00 00  |.L2.....a.txt...|
00000010  00 00 00 00 00 00 62 2e  74 78 74 00 00 04 00 00  |......b.txt.....|
00000020  00 00 00 00 61 2e 74 78  74 00 00 08 00 00 00 00  |....a.txt.......|
00000030  00 00 00 00 28 00 00 00  00 00 00                 |....(......|
0000003b

It's not really a question, but just to make sure that I did not miss anything.

6a. Xz block size

It seems that pixz creates an xz block every 0x1000000 bytes of input. It that always the case? How would pixz behave reading xz blocks of various sizes?

6b. Xz file index block size

Also, if there are many files in the tar, is the pixz file index could be greater than 0x1000000 bytes: would it be stored in more than one block?

6c. Synchronizing xz block size with tar files

When adding a tar file (including tar header, etc.), if there is not enough space left in the current xz block, we could move to a new xz block before adding that tar file.

If we want to extract only that file later, this would allow us to read only one block instead of two (or to be precise one less block in total).

What are your thoughts on that optimization? Would such an tpxz archive be fully compatible with pixz (i.e. would pixz read that kind of archive happily)?

vasi commented 3 years ago

Wow, so many questions! I'll try to answer them all:

0

tpxz is just similar to how other tar things are named. Eg: tar.gz ==.tgz,.tar.bz2 == .tbz2. So if pixz creates .tar.pxz files, it can also call them .tpxz
No idea how popular it is, I don't know how I'd even find out
I'm not aware of any other readers or writers of this format
There's one more advantage of pixz over xz -T, it does parallel _de_compression. But yes, the indexed format is also useful!

1

Nice diagram, thanks! It looks correct
The number in the footer offset isn't really used, as far as I can remember. I'm not sure what I put it there for in the first place!

2

Correct, there's no padding before the file index
To find the file index, we just use the last XZ block. We don't actually use the footer offset, unless I'm misremembering

3

The file index needs its own block so that we can find it easily, by just looking for the last XZ block! (see above) You're right that we could theoretically use the footer offset instead to find it, and then just append the index to the tar data. But I wanted to keep it simple.

4

Sure, you technically could save a byte by dropping the null byte

5

Yup, we don't do anything special to file names

6

Pixz's block sizes are controllable through a variety of parameters. The -1, -2, etc compression levels will affect it, as will -e or -f. See the manpage for details. Pixz can handle different block sizes just fine.
The file index is always stored in just one block. Since we don't parallelize it at all, and it's structured pretty sequentially, there's not much point to splitting it.
Pixz is happy reading blocks of different sizes, so you could certainly attempt to optimize where block boundaries exist. I'm not sure it makes much of a difference to performance in most use-cases, but maybe if you have something very specific in mind

Rogdham commented 3 years ago

Thank you so much for your detailed reply, you're awesome! :heart_eyes:

So the only thing unclear is the offset in the file index footer, i.e. the very last 8 bytes. Could I rely on it to find the beginning of the file index? Doing so would avoid readers to know about the xz format at all, which could make the implementation easier.

I understand that when creating a tpxz file, the file index must be alone in the last xz block, but my point is about reading a tpxz file.

calestyo commented 3 years ago

What kinda fits to this:

It would be nice if pixz manpage could clarify, whether or not it's created files (i.e. also those with indexing) are expected to be compatible with the "standard" xz-utils or not.

Rogdham commented 3 years ago

It would be nice if pixz manpage could clarify, whether or not it's created files (i.e. also those with indexing) are expected to be compatible with the "standard" xz-utils or not.

@calestyo: yes, they are compatible with "standard" xz-utils:

in non-tar mode (no pixz indexing): the created files are regular valid xz files, that can be decompressed with unxz
in tar mode (with pixz indexing): the created files are compatible with tar xJ (i.e. GNU tar using xz -d to decompress, other implementations of tar are expected to work as well); the only change is that the generated tar file (within the xz compression) has some data after the tar's end-of-archive, which is expected to be ignored by tar readers

vasi commented 3 years ago

Yes, and note that even in tar mode, the files are still 100% compatible with xz/liblzma.

@Rogdham

So the only thing unclear is the offset in the file index footer, i.e. the very last 8 bytes. Could I rely on it to find the beginning of the file index? Doing so would avoid readers to know about the xz format at all, which could make the implementation easier.

Hm, how will you seek to the last 8 bytes of the file-index without understanding the xz file format? You'll need liblzma (or equivalent) to do that, and once you have liblzma it's not really any harder to ask "please go to the last block in the file".

Plus there are other downsides:

Since you're checking the footer before the magic, if the file doesn't have a pixz-file-index, you'll see totally invalid data when you try to read the footer. But you'll have no way to know it's invalid, so maybe you end up seeking to a random point in the file. Dealing with that sounds unpleasant.
liblzma (or whatever else you're using to seek in an xz file) will have to read & decompress the entire file-index just to get to the footer, since it's all in one block. It's pretty silly to decompress all that data just to get 8 bytes, then throw it out, and read/decompress it all over again.

Rogdham commented 3 years ago

Hm, how will you seek to the last 8 bytes of the file-index without understanding the xz file format? You'll need liblzma (or equivalent) to do that, and once you have liblzma it's not really any harder to ask "please go to the last block in the file".

Well, for example, the native lzma module in Python allows to decompress xz files into streams without having information about xz block boundaries. I will not be able to use that anyways because seeking to xz block 2 would mean decompressing block 1 for nothing with that implementation, but it's was an example.

It's pretty silly to decompress all that data just to get 8 bytes, then throw it out, and read/decompress it all over again.

Very good point! For some reason I completely miss that, my bad.

So I'm convinced! I will definitively rely on pixz file index to start with the last xz block instead of using the offset in the file footer.

vasi / pixz