Open sebastianbergmann opened 11 months ago
CC @drupol
Interesting idea. I'll have a look at it.
I was hoping to not having to touch phpab anymore ;-)
Once again, my fault ... Sorry đ
No worries. It's a very reasonable request :)
I just tried to implement this - and failed.
The current implementation uses iterators to add the files into the phar. Should be fairly straight forward to hook into this.
The docs on phar::buildFromIterator
state an SplFileInfo
should be returned. So I implemented my own wrapping IteratorIterator
that is (verifiably) constructing my custom SplFileInfo
for each entry.
But none of my overwritten methods are ever called.
I do not currently see any means to control the meta data in the phar at build time. The cli tool mentioned does a binary patch of the raw data. It's not actually using the phar API.
I'm not sure what to do here atm..
Okay, reading the php-src (https://github.com/php/php-src/blob/master/ext/phar/phar_object.c) helped shedding some light.
If I understand the inline docs for ::buildFromIterator
correctly, ext/phar
is always, no matter what, using the info from the actual file being added. I have no idea why there is support for SplFileInfo
at all, as nothing is actually used from it.
There are basically two ways - despite the many different APIs - to add things into the phar: By having ext/phar
get pretty much everything from the actual source file, or by using addFromString
- which does not provide any means to set timestamps.
So, looks like for the time being, there is nothing I can do here - at least not by using the ext/phar
-API: The time stamp has to be either "correct" before the file gets added into the phar or we have to patch the created phar file.
I could, in theory, include Jordi's library and introduce this as a post-processing step. But that's not much different from what you do now already.
Thank you for looking into this.
Maybe @nielsdos, who has done a bit of work on ext/phar
recently, can help? It would be great if starting with PHP 8.4 it would be possible to set a timestamp for all files that are added to a PHAR using ext/phar
.
@sebastianbergmann how should the API look like? Something like setTimestamp(string $filename, int|DateTimeInterface $dateTime)
? That seems doable.
Thank you for your reply, Niels. To be honest, I have not really thought about the API for this yet. The first thing that came to mind was an optional DateTimeImmutable
argument for Phar::buildFromIterator()
to set the timestamp for all files that are added to that timestamp. Or a Phar::setTimestampForAllFiles(DateTimeImmuable)
method.
What do you think, Arne?
CC @seldaek as this could, eventually, hopefully, at some point in the future, make Seld\PharUtils\Timestamps
superfluous.
I'm not yet convinced that having a specific API for timestamps on ext/phar
would be the best way to approach this, as that's rather limiting. Technically, there can be a lot of properties on a file or directory additionally to "only" the timestamp (which one actually? mtime? atime? ...?).
The most obvious thing for me to control meta data was to implement my own SplFileInfo
using my own Iterator
. It's rather unexpected that none of the methods of my object instance are called. Maybe that could be changed?
If we want to enhance the ext/phar
API, we could consider adding an (optional) SplFileInfo
to addFromString
to control meta data as there currently is no way at all to specify properties for the respective entry.
Hello,
Glad to see this conversation happening, as the outcome of this is a definitive improvement for reproducible PHARs.
Adding @theofidry which might be interested to participate.
What metadata besides timestamps do you want to control that you can't control now? Uid and gid come to my mind.
The most obvious thing for me to control meta data was to implement my own
SplFileInfo
using my ownIterator
. It's rather unexpected that none of the methods of my object instance are called. Maybe that could be changed?
I think we probably should make this work. If you still have the reproduction test case please share it. Also, this is imo orthogonal to improving the API for Phar to allow overriding metadata.
If we want to enhance the
ext/phar
API, we could consider adding an (optional)SplFileInfo
toaddFromString
to control meta data as there currently is no way at all to specify properties for the respective entry.
Not a fan of this particular idea. SplFileInfo is tied to an actual file normally, so that seems like abusing the class for the purpose of holding metadata.
Hi!
What metadata besides timestamps do you want to control that you can't control now? Uid and gid come to my mind.
I do not know if there is other metadata that needs to be manipulated.
From a reproducibility perspective the only thing I'm missing is rather clarification of what can cause a PHAR give different signatures.
We had the case with PHPStan on several occasions that the content the PHAR is built looked identical but for the signature. We tried to look deeper into it but besides the signature, the content of the PHAR looked identical. The only difference I could note was when looking at the hexes, the order of the files looked different.
So to me it would be great to know exactly what defines the signature and, provided built the PHAR from the exact same source and the timestamp and algorithm are set, what could still make it that the built PHARs are not identical.
I assume the system filesystem could be a factor, or LC_COLLATE
, or the compression algorithm that may not be deterministic? Would be great to have more information there.
A bit unrelated so I don't want to expend to much on it here, but I'm trying to compile some improvements I would like to see in the PHAR extension in https://github.com/box-project/box/issues/1154 (it is entirely unpolished and probably missing some stuff still).
Back to the topic though, I would be more than content with Phar::setTimestamps(DateTimeImmutable): void
, but it needs to respect/account for the signing algorithm. The problem is already solved in user-land with https://github.com/Seldaek/phar-utils (which has already been mentioned). The entire code in Box for it currently looks like this:
// src/Box.php
public function sign(
SigningAlgorithm $signingAlgorithm,
?DateTimeImmutable $timestamp = null,
): void {
if (null === $timestamp) {
$this->phar->setSignatureAlgorithm($signingAlgorithm->value);
return;
}
// Ensure all the changes done to the PHAR are "committed" before we manipulate the file
$phar = $this->phar;
$phar->__destruct();
unset($this->phar);
$util = new \Seld\PharUtils\Timestamps($this->pharFilePath);
$util->updateTimestamps($timestamp);
$util->save(
$this->pharFilePath,
$signingAlgorithm->value,
);
$this->phar = new Phar($this->pharFilePath);
}
I do not know if there is other metadata that needs to be manipulated.
File and directory permissions can be manipulated.
We had the case with PHPStan on several occasions that the content the PHAR is built looked identical but for the signature. We tried to look deeper into it but besides the signature, the content of the PHAR looked identical. The only difference I could note was when looking at the hexes, the order of the files looked different.
That would be also amazing if we could fix that at the same time.
What metadata besides timestamps do you want to control that you can't control now? Uid and gid come to my mind.
I actually don't remember exactly what the phar file adding preserves in terms of metadata. In theory, it could be pretty much everything that belongs to a file.
The most obvious thing for me to control meta data was to implement my own
SplFileInfo
using my ownIterator
. It's rather unexpected that none of the methods of my object instance are called. Maybe that could be changed?I think we probably should make this work. If you still have the reproduction test case please share it. Also, this is imo orthogonal to improving the API for Phar to allow overriding metadata.
I might still have it as a PoC but I can also create a new one. It wasn't not that much code. :)
If we want to enhance the
ext/phar
API, we could consider adding an (optional)SplFileInfo
toaddFromString
to control meta data as there currently is no way at all to specify properties for the respective entry.Not a fan of this particular idea. SplFileInfo is tied to an actual file normally, so that seems like abusing the class for the purpose of holding metadata.
Point taken - but only if you want to add something that is not resembling a file within the phar. Otherwise it would make perfect sense to me. The reason I was suggesting this is we then have a symmetric API. It feels a bit odd to have SplFileInfo
when using iterators and something else when adding by string. But looking at the phar API, there are more methods that would need touching to accomplish this. So, maybe, it indeed is not a good idea ;)
The reason I'm not happy with the "generic" setTimestamp approach is that it would only cater the single use case of reproducibility - making the assumption that every entry in the archive should be having an identical timestamp and we want to explicitly ignore any other meta data that may be set within the phar for a file or directory.
So maybe, we need to spent some more time on thinking about what we actually want to accomplish first - not limiting us to the simplest solution that would "merely" make reproducibility work. And then what the best API is. And what to do with regards to BC. (Maybe the answer to the later turns out to be a wrapper "ReproduciblePhar" ;) )
@theofidry
From a reproducibility perspective the only thing I'm missing is rather clarification of what can cause a PHAR give different signatures.
We had the case with PHPStan on several occasions that the content the PHAR is built looked identical but for the signature. We tried to look deeper into it but besides the signature, the content of the PHAR looked identical. The only difference I could note was when looking at the hexes, the order of the files looked different.
Filesystem file order is indeed non-deterministic. Could be fixed probably by using a sorted iterator.
A bit unrelated so I don't want to expend to much on it here, but I'm trying to compile some improvements I would like to see in the PHAR extension in box-project/box#1154 (it is entirely unpolished and probably missing some stuff still).
Is this an RFC you're going to pursue yourself? And are you able to do the implementation yourself or do you need help?
@theseer
I might still have it as a PoC but I can also create a new one. It wasn't not that much code. :)
Well, feel free to share something and I'll take a look.
So maybe, we need to spent some more time on thinking about what we actually want to accomplish first - not limiting us to the simplest solution that would "merely" make reproducibility work. And then what the best API is. And what to do with regards to BC. (Maybe the answer to the later turns out to be a wrapper "ReproduciblePhar" ;) )
Right, if someone can come up with a proposal I can do the implementation.
Filesystem file order is indeed non-deterministic. Could be fixed probably by using a sorted iterator.
Just to clarify: this is not possible at the moment right?
IIRC I tried to use addFile()
or addFromString()
at some point but either it was not enough or it was much slower than using buildFromDirectory()
.
Is this an RFC you're going to pursue yourself? And are you able to do the implementation yourself or do you need help?
0 chance I can implement any of them đ At the moment those are just very rough notes on things that came to my mind so they need to be fleshed out a bit more and would also be good to review what would be really helpful with the PHAR building process or PHAR manipulation. Happy to spend more time on it and create an RFC for it if you or someone can actually implement them.
Filesystem file order is indeed non-deterministic. Could be fixed probably by using a sorted iterator.
Having deterministic ordering by default would for sure be nice. In Composer we do build the phar using a sorted iterator, and we do add everything via addFromString so yes it would also be nice to have a way to specify metadata in there (I guess full access to file modes, mtime, or whatever ends up actually written in the archive), then we would not have to use phar-utils to overwrite timestamps after the fact.
I think my ideal API would be something like new params for addFromString (and possibly addFile too to override file properties), like addFromString($path, $contents, $mtime = null, $mode = null)
. That'd be easy enough to use.
If you're using buildFromDirectory or from iterator, then it's a bit more complex to override a specific file's info, but I think forcing someone to use the addFile/addFromString to do that would be reasonable.
Filesystem file order is indeed non-deterministic. Could be fixed probably by using a sorted iterator.
Just to clarify: this is not possible at the moment right?
You can do this with your own code.
0 chance I can implement any of them đ At the moment those are just very rough notes on things that came to my mind so they need to be fleshed out a bit more and would also be good to review what would be really helpful with the PHAR building process or PHAR manipulation. Happy to spend more time on it and create an RFC for it if you or someone can actually implement them.
I can help with the implementation.
Having deterministic ordering by default would for sure be nice.
Operating systems generally don't make a guarantee about filesystem order. You'll have to do this yourself.
I think my ideal API would be something like new params for addFromString (and possibly addFile too to override file properties), like addFromString($path, $contents, $mtime = null, $mode = null). That'd be easy enough to use.
I disagree. If you have to add those parameters to different APIs, it's better to just provide separate APIs to override that metadata where the doesn't care about how a file was added.
I disagree. If you have to add those parameters to different APIs, it's better to just provide separate APIs to override that metadata where the doesn't care about how a file was added.
Good point, it's such a rarely used api tho i don't really mind how it looks i have to say. As long as it works and it's documented anything will be great :)
Operating systems generally don't make a guarantee about filesystem order. You'll have to do this yourself.
I'm aware i just think the phar could sort files at least when creating from iterator and maybe even sort whatever is added by filename but i can see that's probably too opinionated so I can also live with the current situation for sure.
You can do this with your own code.
@nielsdos I'll have to double check, but IIRC using this over using buildFromDir()
was like x10 slower. I'll bench it once I get back home
@theofidry I have done performance improvements to different extensions of PHP before, if you give me code I can check what the bottleneck is and try to improve it.
Cool, will try to check it ASAP
@nielsdos Here's a reproducer for the "no method called on SplFileInfo":
<?php declare(strict_types=1);
class MySplFileInfo extends SplFileInfo {
public function getPathname(): string {
echo "[Pathname]";
return '';
}
public function getMTime(): int {
echo "[MTime]";
return 0;
}
public function getCTime(): int {
echo "[CTime]";
return 0;
}
public function getATime(): int {
echo "[ATime]";
return 0;
}
}
class MyIterator extends RecursiveDirectoryIterator {
public function current(): SplFileInfo {
echo "[ Found: " . parent::current()->getPathname() . " ]\n";
return new MySplFileInfo( parent::current()->getPathname());
}
}
$workdir = uniqid('/tmp/');
mkdir($workdir . '/content', recursive: true);
file_put_contents($workdir . '/content/hello.txt', "Hello world.");
$phar = new \Phar($workdir . '/test.phar');
$phar->startBuffering();
$phar->buildFromIterator(
new RecursiveIteratorIterator(
new MyIterator($workdir . '/content', FilesystemIterator::SKIP_DOTS)
),
$workdir
);
$phar->stopBuffering();
$result = new \Phar($workdir . '/test.phar', 0, 'test.phar');
var_dump($result['content/hello.txt']);
This produces the following output on my system (PHP 8.3.6):
$ php iterator.php
[ Found: /tmp/6636b783648e1/content/hello.txt ]
object(PharFileInfo)#3 (2) {
["pathName":"SplFileInfo":private]=>
string(53) "phar:///tmp/6636b783648e1/test.phar/content/hello.txt"
["fileName":"SplFileInfo":private]=>
string(9) "hello.txt"
}
@theseer I've made a draft PR that fixes your problem here: https://github.com/php/php-src/pull/14143 It adds support for having custom SplFileInfo objects. Because Phars only have one timestamp, and it represents the modification time, only getMTime() will be called in your example and that will be the timestamp Phar uses for the file. It also supports overriding getPathname() which is likely only useful for directory iterators. If you can, please try out the PR and check if I have forgotten anything. But I think buildFromIterator() is the only API that actually handles SplFileInfo objects as input unless I'm mistaken.
Created the benchmark here. Will publish the results tomorrow off to bed.
I however realise I cannot simply switch to addFromString()
in Box as the content changes (e.g. I execute composer dump-autoload
) so using this method would require to do some additional FS calls. I guess I would have to use buildFromIterator()
instead (for deterministic results).
Here are the results:
with PHP version 8.3.1, xdebug â, opcache â
\PharBuildBench\BuildFromStringBench
bench...................................I9 - Mo20.496s (±3.04%)
\PharBuildBench\BuildFromDirBench
bench...................................I9 - Mo128.825ms (±8.87%)
\PharBuildBench\BuildFromStringWithBufferingBench
bench...................................I9 - Mo18.971s (±0.88%)
Yes composer phar build is slow too due to addFromString but imo that's normal and even stated in the docs:
Note: Phar::addFile(), Phar::addFromString() and Phar::offsetSet() save a new phar archive each time they are called. If performance is a concern, Phar::buildFromDirectory() or Phar::buildFromIterator() should be used instead.
The only way to speed this up would be a new api which buffers and requires explicit save() to commit changes to disk. But imo phar creation isn't a super performance critical path so not sure if it is worth the hassle.
The only way to speed this up would be a new api which buffers and requires explicit save() to commit changes to disk. But imo phar creation isn't a super performance critical path so not sure if it is worth the hassle.
I thought that was what the start buffering and stop buffering was about...
But imo phar creation isn't a super performance critical path so not sure if it is worth the hassle.
I agree to a degree, feels like a jump from 120ms to 20s is... meh. Not critical, neither high priority for sure though yes.
It's indeed the fact that the phar is flushed that makes it slow. But the surprising detail is that it's not the writing of the file itself, it's the computation of the signature. The signature computation takes 95% of runtime according to perf profiling... Indeed it is strange though that addFromString does not respect buffering. Maybe it should, but there are probably people that depend on the current behaviour either knowingly or unknowingly, so even though it would be a simple change to make it respect buffering, I am reluctant to do so. Not sure what to do here.
Ok interesting.. is it not respecting buffering at all? In a quick test here it seems to me like it is not writing to the file until the end when buffering, but it is perhaps updating the signature on every file add? Perhaps that needs fixing so signature is only updated if getSignature is called, or if stopBuffering/__destruct is called?
Maybe it should, but there are probably people that depend on the current behaviour either knowingly or unknowingly
This sounds like really something that should be able to be fixed if it is really broken, sure it might break some odd use case but it's clearly not working as advertised and the users of phar-writing aren't all that many IMO.
Ok interesting.. is it not respecting buffering at all?
Nope, it writes the Phar file, just not to the file path you provided.
In a quick test here it seems to me like it is not writing to the file until the end when buffering
This is actually wrong, a Phar file with signature and contents is written in the /tmp
directory when buffering is enabled and using addFromString. Why? I'm not sure why phar_flush still does this but I think it's because some operations internally expect to work on a file. And I think the reason addFromString does not respect buffering is simply an oversight.
Maybe it should, but there are probably people that depend on the current behaviour either knowingly or unknowingly
This sounds like really something that should be able to be fixed if it is really broken, sure it might break some odd use > case but it's clearly not working as advertised and the users of phar-writing aren't all that many IMO.
Perhaps, this would need discussion on the ML and in the worst case an RFC.
It's actually a trivial code change to fix this, it's just adding a check to see if EDIT: debatable
The problem is also that, in typical PHP fashion, the number of helpful comments in ext/phar can be counted on one hand and the test coverage is not amazing. Therefore, understanding why something was made the way it was, is often hard to understand.donotflush
is set.
The donotflush
was implemented in https://github.com/php/php-src/commit/8de7bd61bf5dcbb51df99c3bd5d3c830e216bdf1 and at that time it did actually have the behaviour of not writing a phar at all.
Actually, it turns out almost any operation can create the temporary file. The fact the temporary file is written is actually related to handling compressed files within the phar. This entire performance issue is related to https://github.com/php/php-src/issues/13055
The only thing that was missing for the build of PHPUnit's PHAR to become reproducible was to set the timestamps of the files in the PHAR to a consistent date.
I have implemented that in https://github.com/sebastianbergmann/phpunit/commit/4627215b9cf045999d1afcdfbfe7299f62658217 using seld/phar-utils.
However, I do not like that the PHAR is now changed after it has been created using PHPAB.
It would be nice if PHPAB would do this while it creates the PHAR. What do you think, Arne?