pear / Archive_Tar

http://pear.php.net/package/Archive_Tar
76 stars 34 forks source link

Only Partial extraction of large tarball #50

Open marcelklehr opened 2 weeks ago

marcelklehr commented 2 weeks ago

File is available here: download.nextcloud.com/server/apps/text2image_stablediffusion/stable-diffusion-xl-base-1.0

$tarManager = new TAR($archivePath);
$tarManager->extract($__DIR__ );

Result:

ls apps/text2image_stablediffusion/models/stable-diffusion-xl
unet  vae_decoder  vae_encoder

Plus some warnings:

Uninitialized string offset 2 at /nextcloud/server/3rdparty/pear/archive_tar/Archive/Tar.php#1790

Expected behavior

Should extract all folders

mcdruid commented 2 weeks ago

Thanks for the report.

I've not had a chance to try to reproduce this yet, but just noting that the models.tar.gz file is 7.6G.

My first guess would be that PHP is hitting a memory exhaustion.

How much memory is allocated / available?

Are there any error messages other than the Uninitialized string offset 2 at ... ?

What PHP version are we talking about here?

mcdruid commented 2 weeks ago

Was able to reproduce this with a quick test in a PHP shell (php -a):

php > include('/path/to/Archive_Tar/vendor/autoload.php');
php > include('/path/to/Archive_Tar/Archive/Tar.php');
php > $tarManager = new Archive_Tar('/tmp/models.tar.gz');
php > $tarManager->extract(__DIR__);
PHP Warning:  Uninitialized string offset 2 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning:  Uninitialized string offset 3 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning:  Uninitialized string offset 4 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning:  Uninitialized string offset 5 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning:  Uninitialized string offset 6 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning:  Uninitialized string offset 7 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning:  Uninitialized string offset 8 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning:  Uninitialized string offset 9 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning:  Uninitialized string offset 10 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning:  Uninitialized string offset 11 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
php >

I was left with:

$ du -sh stable-diffusion-xl/
320M    stable-diffusion-xl/

$ ls stable-diffusion-xl/
unet  vae_decoder  vae_encoder

...which seems like it's definitely not the full content of the archive.

That's on:

$ php -v
PHP 8.1.2-1ubuntu2.18 (cli) (built: Jun 14 2024 15:52:55) (NTS)

...with:

php > print ini_get('memory_limit');
-1

On a machine with 32G of memory, FWIW.

Extracting the same archive with the standard tar utility:

$ tar zxvf models.tar.gz 
stable-diffusion-xl/
stable-diffusion-xl/vae_encoder/
stable-diffusion-xl/vae_encoder/model.onnx
stable-diffusion-xl/vae_encoder/config.json
stable-diffusion-xl/vae_decoder/
stable-diffusion-xl/vae_decoder/model.onnx
stable-diffusion-xl/vae_decoder/config.json
stable-diffusion-xl/unet/
stable-diffusion-xl/unet/model.onnx_data
stable-diffusion-xl/unet/model.onnx
stable-diffusion-xl/unet/config.json
stable-diffusion-xl/model_index.json
stable-diffusion-xl/tokenizer_2/
stable-diffusion-xl/tokenizer_2/vocab.json
stable-diffusion-xl/tokenizer_2/special_tokens_map.json
stable-diffusion-xl/tokenizer_2/tokenizer_config.json
stable-diffusion-xl/tokenizer_2/merges.txt
stable-diffusion-xl/tokenizer/
stable-diffusion-xl/tokenizer/vocab.json
stable-diffusion-xl/tokenizer/special_tokens_map.json
stable-diffusion-xl/tokenizer/tokenizer_config.json
stable-diffusion-xl/tokenizer/merges.txt
stable-diffusion-xl/text_encoder/
stable-diffusion-xl/text_encoder/model.onnx
stable-diffusion-xl/text_encoder/config.json
stable-diffusion-xl/scheduler/
stable-diffusion-xl/scheduler/scheduler_config.json
stable-diffusion-xl/text_encoder_2/
stable-diffusion-xl/text_encoder_2/model.onnx_data
stable-diffusion-xl/text_encoder_2/model.onnx
stable-diffusion-xl/text_encoder_2/config.json

$ du -sh stable-diffusion-xl/
13G     stable-diffusion-xl/

So yup, looks like a bug.

mcdruid commented 2 weeks ago

I've only been able to have a quick look at this so far, but wondering if it might be some bad data inside the tarball - which presumably the standard tar utility is able to handle.

The problem seems to happen when Archive_Tar tries to process stable-diffusion-xl/unet/model.onnx_data inside the archive.

(I am not an expert on this code and I may get some terminology and/or interpretation wrong..)

It tries to extract the metadata from the record:

$v_data = unpack($this->_fmt, $v_binary_data);

...and seems to get a bad/corrupt value for size:

size = "�"

image

Not surprisingly it look like things go bad from there.

I'm not sure if this really is a problem in the data, or the manifestation of Archive_Tar's inability to process the data properly.

I think it could be the former, but you could argue that it's both if other tools are able to successfully extract the archive.

mcdruid commented 2 weeks ago

After doing a few experiments (e.g. re-creating the archive with tar) it looks like maybe the data is okay, it's just that Archive_Tar cannot handle the very large file within the archive.

stable-diffusion-xl/unet/model.onnx_data seems to be close to 10G uncompressed.

I suspect it wasn't very common to work with files of that size in PHP when most of the code in this project was written.

I'll have a closer look at what happens when Archive_Tar tries to process this within the archive, but it looks like for some reason it's not able to parse the size of the file properly.

mcdruid commented 2 weeks ago

https://en.wikipedia.org/wiki/Tar_(computing)

... although there are 12 bytes reserved for storing the file size, only 11 octal digits can be stored. This gives a maximum file size of 8 gigabytes on archived files. To overcome this limitation, in 2001 star introduced a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field.

Looks like that's the limitation that's being hit when Archive_Tar processes this archive.

So I think this is not so much as a bug but rather a feature request to implement the extension to the tar format whereby files > 8gb are supported.

At present, they are not.

marcelklehr commented 2 weeks ago

Mmh, what about this PR? https://github.com/pear/Archive_Tar/pull/15 Thank you for investigating this!

mcdruid commented 2 weeks ago

Yeah that looks like it's implementing support for the right thing, but sadly it doesn't seem to be working properly.

More background info on tar formats:

https://www.gnu.org/software/tar/manual/html_node/Formats.html#Formats

As far as I can see, Archive_Tar does support ustar but possibly not posix (or at least the implementation doesn't work fully).

mcdruid commented 2 weeks ago

https://mort.coffee/home/tar/ looks like a pretty good write up (I think I recognise the author :) ).

I'm not sure yet whether the data that gets passed to the new \Archive_Tar::_tarRecToSize method has been incorrectly extracted from the header / archive, or whether there's a problem within the processing in the method itself.

As noted earlier, the extracted size certainly looks broken as represented in my IDE when I step through, but that may just be a quirk of the way the octal values are being displayed.

FWIW the $v_binary_data param that's being passed to \Archive_Tar::_readHeader looks like:

stable-diffusion-xl/unet/model.onnx_data000066400017500001750�d!�14505263367020003 0ustar  marcelmarcel

I've run out of time to look at this for now, but will try to come back to it ASAP.

mcdruid commented 2 weeks ago

As far as I can see, this problem goes away if we revert to using the a format to unpack the value for size from the header:

         if (version_compare(PHP_VERSION, "5.5.0-dev") < 0) {
             $this->_fmt = "a100filename/a8mode/a8uid/a8gid/a12size/a12mtime/" .
                 "a8checksum/a1typeflag/a100link/a6magic/a2version/" .
                 "a32uname/a32gname/a8devmajor/a8devminor/a131prefix";
         } else {
-            $this->_fmt = "Z100filename/Z8mode/Z8uid/Z8gid/Z12size/Z12mtime/" .
+            $this->_fmt = "Z100filename/Z8mode/Z8uid/Z8gid/a12size/Z12mtime/" .
                 "Z8checksum/Z1typeflag/Z100link/Z6magic/Z2version/" .
                 "Z32uname/Z32gname/Z8devmajor/Z8devminor/Z131prefix";
         }

I'm not yet exactly sure why - it seems that Z behaves slightly differently in a way that seems to corrupt the size data when it's been formatted differently to accommodate an > 8gb file.

https://mort.coffee/home/tar/ says:

If a file is over 8GiB, it will set the high bit of the first character in file_size, and the rest of the string is parsed as base 256 (i.e it's treated as a 95-bit integer, big endian).

...and that does seem to be what the \Archive_Tar::_tarRecToSize method tries to do, but it looks like the data's not extracted correctly to allow that to work.. AFAICS.

I'd need to understand all the details better before deciding how to approach fixing this properly.

mcdruid commented 2 weeks ago

I've reproduced the bug and fix/workaround (changing Z12size to a12size in the unpack format template) with a test archive containing a (sparse) 10G file.

The change to Z for more recent PHP versions was committed early in 2013: https://github.com/pear/Archive_Tar/commit/de7aade3fbd69f365969c0f34a89d177f94c451a

Some of the relevant docs seems to have disappeared, but looks like we're interested in:

https://bugs.php.net/bug.php?id=63738

https://web.archive.org/web/20201127071942/https://www.php.net/manual/en/migration55.incompatible.php#migration55.incompatible.pack

The latter provides an example of how to maintain backwards-compatibility which closely resembles the change that was committed to Archive_Tar introducing the Z format.

https://github.com/pear/Archive_Tar/commit/184af112669f0257e3ca02b31b36e32c02c0a7b9 .. is the change we've looked at that implemented the parsing of > 8G sizes within tar headers; that was committed late in 2015.

You'd assume that last commit worked on archives containing large files at some point, in conjunction with the newer unpack format.

I don't know if there have been more changes in PHP's behaviour since then which mean we're seeing this bug today, or whether there's another explanation.

I'm not certain the change from a to Z was actually necessary, but I'm not sufficiently confident that it was not to revert that change.

Some testing on different PHP versions etc.. may be enlightening.

@marcelklehr are you able to confirm whether changing the format to use a for the size resolves the issue for you?

mcdruid commented 2 weeks ago

I've submitted a PR that adds a test to check if the size of a large (> 8G) file can be read successfully.

Changing the unpack format back to a12size instead of Z12size fixes the problem in all PHP versions other than PHP 5.4, and doesn't make any of the other tests fail.

I'd be quite happy to drop support for anything older than PHP 5.6 personally.

@ashnazg are you able to review this issue and the PR please?

If you're happy with this direction, perhaps we could drop support for anything before PHP 5.6 and look at only having one format for the unpack.

There's a broader question of whether we ever needed to change from a to Z - we can try testing rolling the whole thing back, but it's probably less risky to just revert the one field we know seems to be causing problems.

mcdruid commented 2 weeks ago

Perhaps the change to support the larger sizes was worked on in PHP 5.5 or thereabouts, so was never affected by the Z unpack format.

marcelklehr commented 1 week ago

Thank you again for digging into this!!

are you able to confirm whether changing the format to use a for the size resolves the issue for you?

Yes, this resolves the issue for me!