Open marcelklehr opened 2 weeks ago
Thanks for the report.
I've not had a chance to try to reproduce this yet, but just noting that the models.tar.gz
file is 7.6G.
My first guess would be that PHP is hitting a memory exhaustion.
How much memory is allocated / available?
Are there any error messages other than the Uninitialized string offset 2 at ...
?
What PHP version are we talking about here?
Was able to reproduce this with a quick test in a PHP shell (php -a
):
php > include('/path/to/Archive_Tar/vendor/autoload.php');
php > include('/path/to/Archive_Tar/Archive/Tar.php');
php > $tarManager = new Archive_Tar('/tmp/models.tar.gz');
php > $tarManager->extract(__DIR__);
PHP Warning: Uninitialized string offset 2 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning: Uninitialized string offset 3 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning: Uninitialized string offset 4 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning: Uninitialized string offset 5 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning: Uninitialized string offset 6 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning: Uninitialized string offset 7 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning: Uninitialized string offset 8 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning: Uninitialized string offset 9 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning: Uninitialized string offset 10 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
PHP Warning: Uninitialized string offset 11 in /path/to/Archive_Tar/Archive/Tar.php on line 1790
php >
I was left with:
$ du -sh stable-diffusion-xl/
320M stable-diffusion-xl/
$ ls stable-diffusion-xl/
unet vae_decoder vae_encoder
...which seems like it's definitely not the full content of the archive.
That's on:
$ php -v
PHP 8.1.2-1ubuntu2.18 (cli) (built: Jun 14 2024 15:52:55) (NTS)
...with:
php > print ini_get('memory_limit');
-1
On a machine with 32G of memory, FWIW.
Extracting the same archive with the standard tar utility:
$ tar zxvf models.tar.gz
stable-diffusion-xl/
stable-diffusion-xl/vae_encoder/
stable-diffusion-xl/vae_encoder/model.onnx
stable-diffusion-xl/vae_encoder/config.json
stable-diffusion-xl/vae_decoder/
stable-diffusion-xl/vae_decoder/model.onnx
stable-diffusion-xl/vae_decoder/config.json
stable-diffusion-xl/unet/
stable-diffusion-xl/unet/model.onnx_data
stable-diffusion-xl/unet/model.onnx
stable-diffusion-xl/unet/config.json
stable-diffusion-xl/model_index.json
stable-diffusion-xl/tokenizer_2/
stable-diffusion-xl/tokenizer_2/vocab.json
stable-diffusion-xl/tokenizer_2/special_tokens_map.json
stable-diffusion-xl/tokenizer_2/tokenizer_config.json
stable-diffusion-xl/tokenizer_2/merges.txt
stable-diffusion-xl/tokenizer/
stable-diffusion-xl/tokenizer/vocab.json
stable-diffusion-xl/tokenizer/special_tokens_map.json
stable-diffusion-xl/tokenizer/tokenizer_config.json
stable-diffusion-xl/tokenizer/merges.txt
stable-diffusion-xl/text_encoder/
stable-diffusion-xl/text_encoder/model.onnx
stable-diffusion-xl/text_encoder/config.json
stable-diffusion-xl/scheduler/
stable-diffusion-xl/scheduler/scheduler_config.json
stable-diffusion-xl/text_encoder_2/
stable-diffusion-xl/text_encoder_2/model.onnx_data
stable-diffusion-xl/text_encoder_2/model.onnx
stable-diffusion-xl/text_encoder_2/config.json
$ du -sh stable-diffusion-xl/
13G stable-diffusion-xl/
So yup, looks like a bug.
I've only been able to have a quick look at this so far, but wondering if it might be some bad data inside the tarball - which presumably the standard tar utility is able to handle.
The problem seems to happen when Archive_Tar tries to process stable-diffusion-xl/unet/model.onnx_data
inside the archive.
(I am not an expert on this code and I may get some terminology and/or interpretation wrong..)
It tries to extract the metadata from the record:
$v_data = unpack($this->_fmt, $v_binary_data);
...and seems to get a bad/corrupt value for size
:
size = "�"
Not surprisingly it look like things go bad from there.
I'm not sure if this really is a problem in the data, or the manifestation of Archive_Tar's inability to process the data properly.
I think it could be the former, but you could argue that it's both if other tools are able to successfully extract the archive.
After doing a few experiments (e.g. re-creating the archive with tar) it looks like maybe the data is okay, it's just that Archive_Tar cannot handle the very large file within the archive.
stable-diffusion-xl/unet/model.onnx_data
seems to be close to 10G uncompressed.
I suspect it wasn't very common to work with files of that size in PHP when most of the code in this project was written.
I'll have a closer look at what happens when Archive_Tar tries to process this within the archive, but it looks like for some reason it's not able to parse the size of the file properly.
https://en.wikipedia.org/wiki/Tar_(computing)
... although there are 12 bytes reserved for storing the file size, only 11 octal digits can be stored. This gives a maximum file size of 8 gigabytes on archived files. To overcome this limitation, in 2001 star introduced a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field.
Looks like that's the limitation that's being hit when Archive_Tar processes this archive.
So I think this is not so much as a bug but rather a feature request to implement the extension to the tar format whereby files > 8gb are supported.
At present, they are not.
Mmh, what about this PR? https://github.com/pear/Archive_Tar/pull/15 Thank you for investigating this!
Yeah that looks like it's implementing support for the right thing, but sadly it doesn't seem to be working properly.
More background info on tar formats:
https://www.gnu.org/software/tar/manual/html_node/Formats.html#Formats
As far as I can see, Archive_Tar does support ustar but possibly not posix (or at least the implementation doesn't work fully).
https://mort.coffee/home/tar/ looks like a pretty good write up (I think I recognise the author :) ).
I'm not sure yet whether the data that gets passed to the new \Archive_Tar::_tarRecToSize method has been incorrectly extracted from the header / archive, or whether there's a problem within the processing in the method itself.
As noted earlier, the extracted size
certainly looks broken as represented in my IDE when I step through, but that may just be a quirk of the way the octal values are being displayed.
FWIW the $v_binary_data
param that's being passed to \Archive_Tar::_readHeader looks like:
stable-diffusion-xl/unet/model.onnx_data 0000664 0001750 0001750 � d!� 14505263367 020003 0 ustar marcel marcel
I've run out of time to look at this for now, but will try to come back to it ASAP.
As far as I can see, this problem goes away if we revert to using the a
format to unpack the value for size
from the header:
if (version_compare(PHP_VERSION, "5.5.0-dev") < 0) {
$this->_fmt = "a100filename/a8mode/a8uid/a8gid/a12size/a12mtime/" .
"a8checksum/a1typeflag/a100link/a6magic/a2version/" .
"a32uname/a32gname/a8devmajor/a8devminor/a131prefix";
} else {
- $this->_fmt = "Z100filename/Z8mode/Z8uid/Z8gid/Z12size/Z12mtime/" .
+ $this->_fmt = "Z100filename/Z8mode/Z8uid/Z8gid/a12size/Z12mtime/" .
"Z8checksum/Z1typeflag/Z100link/Z6magic/Z2version/" .
"Z32uname/Z32gname/Z8devmajor/Z8devminor/Z131prefix";
}
I'm not yet exactly sure why - it seems that Z
behaves slightly differently in a way that seems to corrupt the size data when it's been formatted differently to accommodate an > 8gb file.
https://mort.coffee/home/tar/ says:
If a file is over 8GiB, it will set the high bit of the first character in file_size, and the rest of the string is parsed as base 256 (i.e it's treated as a 95-bit integer, big endian).
...and that does seem to be what the \Archive_Tar::_tarRecToSize
method tries to do, but it looks like the data's not extracted correctly to allow that to work.. AFAICS.
I'd need to understand all the details better before deciding how to approach fixing this properly.
I've reproduced the bug and fix/workaround (changing Z12size
to a12size
in the unpack format template) with a test archive containing a (sparse) 10G file.
The change to Z
for more recent PHP versions was committed early in 2013: https://github.com/pear/Archive_Tar/commit/de7aade3fbd69f365969c0f34a89d177f94c451a
Some of the relevant docs seems to have disappeared, but looks like we're interested in:
https://bugs.php.net/bug.php?id=63738
The latter provides an example of how to maintain backwards-compatibility which closely resembles the change that was committed to Archive_Tar introducing the Z
format.
https://github.com/pear/Archive_Tar/commit/184af112669f0257e3ca02b31b36e32c02c0a7b9 .. is the change we've looked at that implemented the parsing of > 8G sizes within tar headers; that was committed late in 2015.
You'd assume that last commit worked on archives containing large files at some point, in conjunction with the newer unpack format.
I don't know if there have been more changes in PHP's behaviour since then which mean we're seeing this bug today, or whether there's another explanation.
I'm not certain the change from a
to Z
was actually necessary, but I'm not sufficiently confident that it was not to revert that change.
Some testing on different PHP versions etc.. may be enlightening.
@marcelklehr are you able to confirm whether changing the format to use a
for the size resolves the issue for you?
I've submitted a PR that adds a test to check if the size of a large (> 8G) file can be read successfully.
Changing the unpack format back to a12size
instead of Z12size
fixes the problem in all PHP versions other than PHP 5.4, and doesn't make any of the other tests fail.
I'd be quite happy to drop support for anything older than PHP 5.6 personally.
@ashnazg are you able to review this issue and the PR please?
If you're happy with this direction, perhaps we could drop support for anything before PHP 5.6 and look at only having one format for the unpack.
There's a broader question of whether we ever needed to change from a
to Z
- we can try testing rolling the whole thing back, but it's probably less risky to just revert the one field we know seems to be causing problems.
Perhaps the change to support the larger sizes was worked on in PHP 5.5 or thereabouts, so was never affected by the Z
unpack format.
Thank you again for digging into this!!
are you able to confirm whether changing the format to use a for the size resolves the issue for you?
Yes, this resolves the issue for me!
File is available here: download.nextcloud.com/server/apps/text2image_stablediffusion/stable-diffusion-xl-base-1.0
Result:
Plus some warnings:
Expected behavior
Should extract all folders