Allow adding files being appended to by another process.

shimile commented 1 year ago

I am using libzip5 through PHP's zip PECL extension, to create a zip file collecting hundreds, probably more than a thousand files, into a ZIP file.

Last night, our CI collected libzip5-1.10.0, introduced into remi-repo; Since then, the code is failing zip close with error 'Unexpected length of data'.

Unfortunately, I cannot give a reproducer code, as a mini-reproducer, passes. It is probably in one of the inputs of one of the 1000+ files that is triggering the bug - but - the message I get has no path for me to figure out which of the files is that. I do not know if the lack of the path causing the error is an omission of the library or the wrapping PHP extension I'm using.

I've examined the changes made to libzip between my previously installed version and the newly installed one, and found:

--- libzip-1.9.2/lib/zip.h      2022-06-28 17:17:12.000000000 +0300
+++ libzip-1.10.0/lib/zip.h     2023-06-23 17:27:16.000000000 +0300
@@ -135,6 +148,8 @@
 #define ZIP_ER_TELL 30            /* S Tell error */
 #define ZIP_ER_COMPRESSED_DATA 31 /* N Compressed data invalid */
 #define ZIP_ER_CANCELLED 32       /* N Operation cancelled */
+#define ZIP_ER_DATA_LENGTH 33     /* N Unexpected length of data */

And:

--- libzip-1.9.2/lib/zip_source_crc.c   2022-06-28 17:17:12.000000000 +0300
+++ libzip-1.10.0/lib/zip_source_crc.c  2023-06-23 17:27:16.000000000 +0300
@@ -141,6 +141,10 @@
         st = (zip_stat_t *)data;

         if (ctx->crc_complete) {
+            if ((st->valid & ZIP_STAT_SIZE) && st->size != ctx->size) {
+                zip_error_set(&ctx->error, ZIP_ER_DATA_LENGTH, 0);
+                return -1;
+            }

So, if the failure is in the new code (assuming I'm barking at the correct tree), it could have not happened before.

At first, I opened an issue on the PHP PECL extension GitHub (even though I knew I'm looking at an issue in the lib itself, given the above code change), thinking that maybe they can provide me with a way to know which file is causing the error (which I am not even sure I fully understand - is that a CRC checksum issue?), so I can maybe provide a minimal reproducer, but one of the leads of the projects sent me here. See: https://github.com/pierrejoye/php_zip/issues/34

I'll note, that if this new code testing for an issue (in CRC?) is correct with something being wrong, I cannot say that in the real world I've experienced any issue: The ZIP files produced until this upgrade all seem to have worked perfectly, and no corruptions in any of the ~1000 files in the produced archive were encountered in the 1+ years this code has been running.

Happy to try out things as much as I can...

dillof commented 1 year ago

Where do you get the file data from?

The error means that the source added to the zip archive reports a file size that does not correspond to the amount of data it provides.

shimile commented 1 year ago

Where do you get the file data from?

The error means that the source added to the zip archive reports a file size that does not correspond to the amount of data it provides.

So, in other words, libzip is now unable to compress a log file that is being appended to while the zip is being created?

If so, surely there is a flag to revert to the behavior of (probably?) all other archivers in the universe?

dillof commented 1 year ago

I could really do without the attitude. libzip is a hobby project we put a lot of time and effort into, both developing and supporting its users.

Ensuring data integrity is a high priority for us. We had another project where files were read incompletely due to a software bug, and this now catches that and makes sure files are not silently truncated.

Your use case was caught as a side effect of it. I'm sure we'll find a solution.

shimile commented 1 year ago

Sorry about that.

I do appreciate data integrity, but failing to produce the zip at all is also kind of a data integrity problem. It's actually not my first similar issue with libzip: I also had a hard time trying to add directories that had files removed from them inflight (due to log rotation that by murphy law always happens at the time I do logs collection into ZIP). Apparently 'add file' doesn't really fopen() the file (?), just later on 'close', if the file is removed after add and before close - the zip is also gone with a cryptic 'file not found' (which?). Since I found no solution to this, I had to orchestrate a somewhat crazy solution - each file I want to add to the ZIP, I first create a hardlink to it in an ancestor directory so it can NEVER be removed, and then after zip close I remove all the hardlinks, with the hope I don't have system crash in the middle that leaves orphaned links to old log files.

Really, failure to opening a file should produce a warning and maybe even return an error, but why kill the archive? Likewise for size change. That's what tar does, it outputs something like "file changed as we read it", something like that.

And there should be a flag for people who want to be strict and fail on every issue.

dillof commented 1 year ago

The archive isn't killed, it just isn't closed. You can make modifications and try again.

You're right that we should provide more detailed error information so you can recover. In your case, remove the file from the archive and close it again.

If you were using libzip directly (without the intervening PHP layer), you could open the file yourself and pass the FILE pointer to zip_source_fliep(). If libzip opened all files when you added them, you could easily run out of file descriptors on system with a low limit (libzip is also used in embedded systems).

It is hard to find a solution that works well in all cases, and operating on changing files is definitely an edge case, so the defaults may not be best for you. If we find solutions that are generally useful and fit into the design of libzip, we're more than happy to accommodate you.

shimile commented 1 year ago

I am sure it is used in embedded systems too, maybe even for the same purpose I am using it (among other things) - to compress log files :)

This can also be a flag that you can enable, I don't think that having a flag that makes the lib easy to use in more use cases (or I even dare say: "out of the box") should conflict with the design of any software. Look at the user comments in https://www.php.net/manual/en/ziparchive.close.php - surely I am not the first to try to do such things and fail...

So, to sum it up, do I understand correctly that at least for the time being, this new behavior is not going to be reverted? That would mean I now have to fork libzip, remove the new code and then start to manually track upstream for changes...

0-wiz-0 commented 1 year ago

You don't have to fork libzip. If you have a source that provides a size but can't be trusted, layer another source on it that passes through all commands and just acts on ZIP_SOURCE_STAT to invalidate the size member.

shimile commented 1 year ago

You don't have to fork libzip. If you have a source that provides a size but can't be trusted, layer another source on it that passes through all commands and just acts on ZIP_SOURCE_STAT to invalidate the size member.

I don't think I can do that, as I do not interface with libzip directly. So, that would require me to change and recompile the wrapper; and if I'm already changing and recompiling something, it is a cleaner solution to simply remove the newly introduced bug (from my POV), than re-factoring another code to avoid hitting the new bug...

By the way, I wouldn't call this "provides a size but can't be trusted". I trust the size reported very much - I did not see an evidence that the filesystem is broken. It's just that the new so-called feature assumes that a size of a file cannot change once you opened it. It may be true in Windows(TM), which I think blocks other readers to a file while it is open for a write by another process, but I think[] it has never been true on nix, since its' inception, ~50 years ago. As such, code that runs on *nix shouldn't assume that (or at least, should provide an option to not assume that), even if in some other project there was some other bug.

[*] it is older than me, so I don't know for sure, but I don't see why it shouldn't, it's the Windows behavior that doesn't make sense in that regard

0-wiz-0 commented 1 year ago

Perhaps the easiest change is to pass a length to zip_source_file()(not -1) when you call it - the length of the file at the time you add it. That will probably fix the problem you're seeing.

shimile commented 1 year ago

I actually tried that, as it was suggested by an author in the php wrapper (https://github.com/pierrejoye/php_zip/issues/34#issuecomment-1611442386), however, that did not seem to have any effect. Maybe I missed something, maybe the changing size is not even the issue! who knows... without the error message saying which file caused the failure, I can't even try and isolate it. All I know, is that when I downgrade back, everything starts to work 100%.

Possibly, the code just checks file size before and after the file is opened, and fails if it changed. The amount of bytes asked to be read wouldn't effect the result, if that is the case. Mind you, even if the whole size was automagically calculated by libzip - it should have worked - I am pretty sure my files cannot actually truncate while in motion. So even specifying the real file size wouldn't ever cause a case where you try to read a byte beyond the filesize max value.

dillof commented 1 year ago

I looked at the code, and zip_source_file should ignore data that is added to the file after the source has been created. Obviously, it has a bug, I'll look into it.

zip_source_file() is not equivalent to fopen(). Rather, it creates a source that represents the file as it exists at that time.

This has nothing to do with Unix vs. Windows (we're also Unix developers), but with data integrity. I know you don't seem to care if data is lost, but we do. The reason we added this change was that we actually lost data while adding it to a zip archive, it was not just some hypothetical scenario. And not losing data is one of the main design goals of libzip, so this is non-negotiable.

shimile commented 1 year ago

I do care for data integrity just like you. I just don't understand how adding more data (or even reading until the filesize at the time of open! so no data is added), means that data is lost. If you would check if the file was truncated, i.e. became smaller than it was comparing to the time it was added, then I would agree more with you (but would still permit this with a flag for whomever may be interested). But we're NOT talking about truncating now.

dillof commented 1 year ago

We will fix the bug so zip_source_file doesn't return more data than it reports as its size (i.e. the file size at the time the source was created). Also, we'll add another special value to the length parameter (-2, but we'll add a symbolic name for better readability) that instructs zip_source_file to ignore the file length (i.e. not report it) and read until EOF is reached.

I really don't appreciate your tone of voice and your implications that our point of view is invalid and violating the spirit of Unix that is older than all of us. Like most thing in engineering, this is a question of priorities and tradeoffs, and we obviously judge the factors and their implications on libzip as a whole differently in this case.

In your specific case, it is clear what you want to happen, and I understand that. However, libzip is used in a lot of different situations with a lot of different data sources, and any assumption about what inconsistencies mean will be wrong for some of them. That is why we treat any inconsistency as a potential for data loss. zip_close doesn't know where the data is coming from, and while additional data might be okay when reading from a file, it might indicate data corruption e.g. when read from the network or a compressed stream. Adding a special case to the already very complicated logic of zip_close has its own cost and probably breaks the abstraction sources provide.

Your arguments are quite absolute and confrontational. They don't seem to leave room for other situations or priorities. That makes your bug report much harder to deal with than it has to be. Quite frankly, if it weren't for improving libzip, I would have left it unanswered. libzip is my hobby, and I don't need this kind of stress.

remicollet commented 1 year ago

If you were using libzip directly (without the intervening PHP layer), you could open the file yourself and pass the FILE pointer to zip_source_fliep(). If libzip opened all files when you added them, you could easily run out of file descriptors on system with a low limit (libzip is also used in embedded systems).

FYI: New zip extension will have a flag (OPEN_FILE_NOW) to use zip_source_filep instead of zip_source_file

shimile commented 1 year ago

Your arguments are quite absolute and confrontational. They don't seem to leave room for other situations or priorities. That makes your bug report much harder to deal with than it has to be. Quite frankly, if it weren't for improving libzip, I would have left it unanswered. libzip is my hobby, and I don't need this kind of stress.

I think that by mentioning the word 'flag' about 5 times throughout this bug report, implies that I do see other situations, and think about others too; I'm all for choice! That is the point I have been trying to make all along. The change as it has been done so far, leaves no choice.

I understand and appreciate that libzip is your hobby - and I know how it is, I have my own side-projects as well! However it is still open source (thanks for that!) and free to use even commercially... that means that people rely on your code - of course you don't owe us nothing - but if you're like me, and I think you are, I'm sure you'ld want your code to be used by as many people as possible.

Again I'll repeat: I DO NOT think your point of view is invalid! I see your use case too! My only claim is that your use case is definitely not the only use case, which is why I gave decades-old archivers (such as tar[*]) as example.

[*] Case in point: When tar encounters a file that changed 'while it is being read', it emits: 'error: file changed as we read it' - and sets a non-zero exit code, to indicate there was an issue in backup. However, some people may be OK with that (i.e. if they back up logs) - and that's why tar has a flag: --ignore-failed-read, to make this a non-error. And what I was asking in my very first comment, was something similar... because zipping live logs is, I think, also a quite common use-case. True, tar-gzing or tar-xzing them is probably MORE common, but some people have to generate archives better-suited for Windows people, and libzip, being quite powerful, will sometimes be used for that, too.

dillof commented 1 year ago

Again I'll repeat: I DO NOT think your point of view is invalid! I see your use case too! My only claim is that your use case is definitely not the only use case, which is why I gave decades-old archivers (such as tar[*]) as example.

Well, you certainly seemed to imply it (emphasis added):

If so, surely there is a flag to revert to the behavior of (probably?) all other archivers in the universe?

the new so-called feature

makes the lib easy to use in more use cases (or I even dare say: "out of the box") should conflict with the design of any software

You're clearly angry about this change (which I get) and are letting this come through quite clearly, which I really don't appreciate. While I don't agree with all of them, I have no problem with the factual points you raise. My problem is with how your words come across: aggressive and hostile.

I'm not implying that's how you meant them, but it is how they make me feel when reading them. If you did not mean them that way, please take a moment to step back, read them again, and see how they might seem that way, so you might avoid this issue in the future.

shimile commented 1 year ago

Sorry that this how you received it, it wasn't my intention.

Finally, may I suggest that this new behavior would be documented in the NEWS file (which I checked before opening the bug). It would have saved me debugging time, and may save debugging time for others in the future.

dillof commented 1 year ago

Finally, may I suggest that this new behavior would be documented in the NEWS file (which I checked before opening the bug). It would have saved me debugging time, and may save debugging time for others in the future.

Good point. We didn't foresee this impact of the change, otherwise we would have mentioned it. We'll add an entry in the next release, together with the new special length value.

dillof commented 1 year ago

I've now implemented the special value ZIP_LENGTH_UNCHECKED (-2) for length. If it is passed to zip_source_file(), the file is treated as if its size is unknown. Please let me know if this solves your issue.

However, if the file grows between creating the source and reading the file, the additional data should be ignored and no error should be raised. I don't quite understand why you ran into this issue, unless a file was truncated.

remicollet commented 1 year ago

However, if the file grows between creating the source and reading the file, the additional data should be ignored and no error should be raised. I don't quite understand why you ran into this issue, unless a file was truncated.

Notice that I was unable to reproduce this issue, see test in https://github.com/pierrejoye/php_zip/issues/34#issuecomment-1611482121 so I suspect something else...

remicollet commented 12 months ago

For your information, an easy way to reproduce is adding files from /proc on Linux

I've added a new test in the pecl extension for this to detect any regression https://github.com/pierrejoye/php_zip/commit/78cdb65c64f87af867029c709b27e9c0cd6d0e8a

And indeed, without the LENGTH_UNCHECKED option it fails with "Unexpected length of data"

(perhaps may be also test here, but I'm not familiar with your test system)

nih-at / libzip

Allow adding files being appended to by another process. #390