zip_entries_delete fails for in memory stream zip

prot0man commented 8 months ago

It looks like zip_entries_delete fails to delete files for in memory stream zips because pState->m_pFile is always NULL for them. Some example code to trigger the issue is:

    int result = 0;
    const char outfile_path[] = "myoutfile";
    char file1_name[] = "file1";
    char file1_data[] = "sometestdataforfile1";
    char file2_name[] = "file2";
    char file2_data[] = "sometestdataforfile2";
    char file3_name[] = "file3";
    char file3_data[] = "sometestdataforfile3";
    char *del_entries[] = {file2_name};
    uint8_t *zdata = NULL;
    size_t zdata_len = 0;
    struct zip_t *pzip = NULL;
    ssize_t num_del_entries = 0;

    pzip = zip_stream_open(NULL, 0, ZIP_DEFAULT_COMPRESSION_LEVEL, 'w');
    if(!pzip) {
        printf("Failed to init in-memory zip\n");
        result = TZIP_ERROR;
        goto cleanup;
    }

    zip_entry_open(pzip, file1_name);
    zip_entry_write(pzip, file1_data, strlen(file1_data));
    zip_entry_close(pzip);

    zip_entry_open(pzip, file2_name);
    zip_entry_write(pzip, file2_data, strlen(file2_data));
    zip_entry_close(pzip);

    zip_entry_open(pzip, file3_name);
    zip_entry_write(pzip, file3_data, strlen(file3_data));
    zip_entry_close(pzip);

    num_del_entries = zip_entries_delete(pzip, del_entries, 1);
    if(num_del_entries < 0) {
        printf("Failed to delete entry\n");
        result = TZIP_ERROR;
        goto cleanup;
    }

    zip_stream_copy(pzip, (void **)&zdata, &zdata_len);
    if(!zdata) {
        printf("Failed to get compressed data");
        result = TZIP_ERROR;
        goto cleanup;
    }

    result = write_to_disk(outfile_path, zdata, zdata_len);
    if(result != TZIP_OK) {
        printf("Failed to write to disk: %d\n", result);
        return result;
    }
    printf("Wrote zip to %s\n", outfile_path);

cleanup:
    free(zdata);

    if(pzip) {
        zip_stream_close(pzip);
    }

    return result;

prot0man commented 8 months ago

I've created an MR into my local fork of this repo to illustrate some changes that may help fix the issue at https://github.com/prot0man/zip/pull/1/commits/228756548f1daf37a436b2269b4521f339ad915f

Though those changes result in zip_entries_delete successfully deleting the file, I get CRC warnings when I unzip:

Archive:  myoutfile
  inflating: file1
file3:  mismatching "local" filename (file2),
         continuing with "central" filename version
  inflating: file3                    bad CRC a0c3e8fc  (should be d7c4d86a)

prot0man commented 8 months ago

Actually, with the changes in my fork, it doesn't actually successfully delete the correct file (file2) based on the hexdump of the resulting zip:

proto@D:~/repos/zip/src/scratch$ xxd myoutfile
00000000: 504b 0304 1400 0808 0800 db64 2c58 0000  PK.........d,X..
00000010: 0000 0000 0000 0000 0000 0500 0400 6669  ..............fi
00000020: 6c65 3101 0000 0001 1400 ebff 736f 6d65  le1.........some
00000030: 7465 7374 6461 7461 666f 7266 696c 6531  testdataforfile1
00000040: 504b 0708 46b9 ca39 1900 0000 0000 0000  PK..F..9........
00000050: 1400 0000 0000 0000 504b 0304 1400 0808  ........PK......
00000060: 0800 db64 2c58 0000 0000 0000 0000 0000  ...d,X..........
00000070: 0000 0500 0400 6669 6c65 3201 0000 0001  ......file2.....
00000080: 1400 ebff 736f 6d65 7465 7374 6461 7461  ....sometestdata
00000090: 666f 7266 696c 6532 504b 0708 fce8 c3a0  forfile2PK......
000000a0: 1900 0000 0000 0000 1400 0000 0000 0000  ................
000000b0: 504b 0102 0000 1400 0808 0800 db64 2c58  PK...........d,X
000000c0: 46b9 ca39 1900 0000 1400 0000 0500 0400  F..9............
000000d0: 0000 0000 0000 0000 0000 0000 0000 6669  ..............fi
000000e0: 6c65 3101 0000 0050 4b01 0200 0014 0008  le1....PK.......
000000f0: 0808 00db 642c 586a d8c4 d719 0000 0014  ....d,Xj........
00000100: 0000 0005 0004 0000 0000 0000 0000 0000  ................
00000110: 0058 0000 0066 696c 6533 0100 0000 504b  .X...file3....PK
00000120: 0506 0000 0000 0200 0200 6e00 0000 b000  ..........n.....
00000130: 0000 0000                                ....

kuba-- commented 8 months ago

Unfortunately, deleting entries has some limitations. I means, you cannot open the archive for writing w and delete entries. You have to close it, first (zip_close) and then open for deleting (there is also, just for consistency a special d mode for deleting).

In other words, the workflow should be like this:

zip_open
{
    zip_entry_open
    zip_entry_write
    zip_entry_close
}
zip_close

zip_open("foo.zip", 0, 'd');
{
    zip_entries_delete

    // you can also delete by index, instead of by name
    // zip_entries_deletebyindex
}
zip_close

look at readme example.

prot0man commented 8 months ago

So for my use case, I need to create an in memory zip and potentially remove entries if my archive ever exceeds a size. With my modifications, it almost works on the in memory zip, but the resulting zip clearly has vestigial evidence of the removed entry.

It's important to note that with my example code, I'm able to create an in memory zip and remove an entry (after the changes I have in the linked branch) such that the resulting zip uncompresses to the expected file1 and file3 . If that's API abuse, it may be worth adding some checks for zip_entries_delete that assert that we have the correct mode. I'm going to attempt to see if I can find a clean work-around to enable this use case in my branch though.

I've spent the better part of the day digging in to zip_entries_delete_mark trying to find where the issue might be, but it sounds like what I'm trying to accomplish just shouldn't actually work by design.

As a work-around, I guess I could create a shadow zip, attempt to compress the file in it, and then determine whether the file would put me over the size limit before adding it, but doing that hurts my soul.

On Fri, Jan 12, 2024, 3:39 PM Kuba Podgórski @.***> wrote:

Unfortunately, deleting entries has some limitations. I means, you cannot open the archive for writing w and delete entries. You have to close it, first (zip_close) and then open for deleting (there is also, just for consistency a special d mode for deleting).

In other words, the workflow should be like this:

zip_open { zip_entry_open zip_entry_write zip_entry_close } zip_close

zip_open("foo.zip", 0, 'd'); { zip_entries_delete
// you can also delete by index, instead of by name
// zip_entries_deletebyindex
} zip_close

look at readme example.

— Reply to this email directly, view it on GitHub https://github.com/kuba--/zip/issues/330#issuecomment-1889920897, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGZBJBF657YBZ2WZQ4TRG3YOGNQLAVCNFSM6AAAAABBYM5VEOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBZHEZDAOBZG4 . You are receiving this because you authored the thread.Message ID: @.***>

kuba-- commented 8 months ago

It's all about calculating index and offsets of entries. It's all has to be written to the central directory, at the end.

prot0man commented 8 months ago

Yeah, I was following through zip_central_dir_delete trying to see if there problem was there. I at least verified that after the call to zip_central_dir_move in the first while loop, the end result of the zip (pState->m_central_dir.m_p) didn't contain any references to the deleted file2 entry (via hexdump'ing that buffer). It looks like some other piece of memory is still referencing the deleted entry though.

kuba-- commented 8 months ago

Generally deleting is kind of experimental feature, so I would be very careful and for sure do not mix it with writing.

prot0man commented 8 months ago

It's actually still not clear to me why in zip_entries_delete_mark, the following is done:

  while (i < entry_num) {
    while ((i < entry_num) && (entry_mark[i].type == MZ_KEEP)) {
      ...
    }
    while ((i < entry_num) && (entry_mark[i].type == MZ_DELETE)) {
      ...
    }
    while ((i < entry_num) && (entry_mark[i].type == MZ_MOVE)) {
      ..
    }
    ...
}

Instead of:

  while (i < entry_num) {
    if(entry_mark[i].type == MZ_KEEP) {
      ...
    } else if(entry_mark[i].type == MZ_DELETE) {
      ...
    } else if (entry_mark[i].type == MZ_MOVE) {
      ..
    }
    ...
}

Is it some sort of optimization because doing those actions in bulk via the while loop is more efficient?

After I resolved some of my technical debt with the zip file format, I realized the problem I'm running into is that the central_dir is not in sync with the file headers in some way (probably because the seeks that are done when we're dealing with a file stream need to be mirrored for zip on the heap). I suspect I will be forced to treat the mode 'w' as delete as well or else I'd have to do a disgusting work around that:

Closes the zip I'm actively writing to and gets the zip data
Opens zip with zip data from previous step as delete
Removes entries
Closes zip and gets the zip data
Re-opens zip with zip data from previous step as write
Continue adding to zip until we've reached our size limit

I understand that separating the operations for delete and write simplify the API because keeping everything up-to-date for both in-memory zips and file-backed zips can get complex, but are there any other compelling reasons for this design? I'll continue pushing forward in my branch and hopefully come up with a solution that doesn't require a lot of flaming hoops to jump through.

prot0man commented 8 months ago

I wonder if the issue is that in zip_close, you make a call to mz_zip_writer_finalize_archive, but in zip_stream_close, there is no analogous call to mz_zip_writer_finalize_heap_archive.

prot0man commented 8 months ago

Nevermind, mz_zip_writer_finalize_heap_archive is just basically a wrapper on mz_zip_writer_finalize_archive that additionally copies the data into the passed buffer.

prot0man commented 8 months ago

Alright, finally got things working in #332

I plan on doing some more thorough testing of zip_entries_delete when I get time to see if it handles the edge cases I think of.

kuba-- commented 8 months ago

Thanks! Already merged, so I'm closing the issue.

If you want to add more comments, do not hesitate to reopen it, or create a new one.

kuba-- / zip

zip_entries_delete fails for in memory stream zip #330