libarchive / libarchive

Multi-format archive and compression library
http://www.libarchive.org
Other
3.02k stars 765 forks source link

Support reading from multiple data objects (multivolume reading) #277

Closed kwrobot closed 9 years ago

kwrobot commented 9 years ago

Original issue 166 created by Google Code user mcitadel on 2011-08-13T18:39:46.000Z:

RAR archives can be split into multiple files (to provide multivolume support). Each file contains the RAR signature header, a main archive header, and the optional EOF header. The data blocks are split arbitrarily between each file in a multivolume set of files. Currently, libarchive doesn't handle reading from multiple files.

This patch would introduce reading from multiple files by way of reading from multiple client objects. What would happen is that there is a chain of client objects, each with the callbacks and data necessary to open, read, skip, and close each object it's reading from (such as different files). Data is read from each of these clients as one large stream. I plan on implementing multivolume reading support of RAR files once general reading from multiple streams is accepted and committed to trunk.

I introduced a new callback (switch callback) that can be used to switch from reading of one client to the next or previous client. I needed some way to determine whether a file should be closed because it's going to open the next file, or if it's being closed because libarchive is done reading from the file set. The latter would mean that I also need to free all memory allocated for all data objects of each client.

I've introduced some test cases already for reading from these multiple clients. The test files are simply some reused test rar files that have been splitted using the 'split' program. There's also a test case for supplying custom callbacks and multiple client objects. This custom callbacks test case is essentially the way I see of using libarchive to read from multiple files with custom callbacks. I plan on using libarchive in this way in another application (XBMC).

This patch also updates test_fuzz so it can read from multiple files. Currently, the multiple files used in test_fuzz would have the same result as "test_read_format_rar.rar" would. Once I have RARv3 multivolume reading support implemented, this would provide a better test for test_fuzz.
kwrobot commented 9 years ago

Comment #1 originally posted by Google Code user mcitadel on 2011-08-13T18:57:24.000Z:

<empty>

kwrobot commented 9 years ago

Comment #2 originally posted by Google Code user mcitadel on 2011-08-19T03:35:26.000Z:

I deleted the current patch as I found there were some memory leaks with the changes I was proposing. I'll post another patch soon.
kwrobot commented 9 years ago

Comment #3 originally posted by Google Code user mcitadel on 2011-08-19T19:45:08.000Z:

Here's a new patch with the changes I'm proposing.

This is different from the last patch. Instead of using a doubly linked list of client objects, an array of void pointers inside a struct is used to maintain multiple data objects that will be used to chain multiple streams together. This should be an improvement as far as performance is concerned. Reading from multiple sources, such as multiple files in a multivolume file set, is done by giving libarchive multiple data objects. You can see the changes in archive_read_open_filename.c to see how reading from multiple file will work.

For the cases with archives that have been split using the 'split' program, all format readers will not require any change to read these files.

All existing functions and callbacks will perform the same tasks. The new 'switch' callback performs the operation of closing one file, and opening the next, without freeing the data objects given from a client (same proposed functionality as in the previous patch). There are now functions to set a data object at some index, add an object at an index, and append/prepend a data object in the array of data objects. I've not added functions to retrieve data objects at some index. Perhaps there should be, would welcome discussion on that.

The appropriate test cases are implemented including modifications to test_fuzz for testing multivolume files. All tests pass. Also, this timevthere are no memory leaks or any other memory related breakage.
kwrobot commented 9 years ago

Comment #4 originally posted by Google Code user mcitadel on 2011-08-19T19:45:38.000Z:

<empty> See attachment: multiple_data_objects.patch

kwrobot commented 9 years ago

Comment #5 originally posted by Google Code user mcitadel on 2011-08-19T19:48:43.000Z:

Here's another patch that implements support for reading multivolume RAR files that have been split using RAR's mechanism for splitting RAR files. This patch applies after applying the 'multiple_data_objects.patch'

Test cases are also implemented and all pass.

See attachment: multivolume_rar_support.patch

kwrobot commented 9 years ago

Comment #6 originally posted by Google Code user mcitadel on 2011-09-05T18:28:27.000Z:

I created a branch to make reviewing much easier I hope.
kwrobot commented 9 years ago

Comment #7 originally posted by amejia1 on 2011-09-06T01:18:18.000Z:

Just a general ping.

Were there any other questions or concerns about these changes?
kwrobot commented 9 years ago

Comment #8 originally posted by Google Code user mcitadel on 2011-09-06T01:19:20.000Z:

Oops, sorry. That last message is from me.
kwrobot commented 9 years ago

Comment #9 originally posted by Google Code user mcitadel on 2011-09-22T17:44:16.000Z:

Ping
kwrobot commented 9 years ago

Comment #10 originally posted by kientzle on 2011-09-23T05:37:45.000Z:

It feels like you're doing a lot more work than necessary to handle this particular case.

If you are just interested in handling files created by 'split', then you can do that by adding archive_read_open_filenames (which you've done), but you don't need to make any changes to archive_read.c at all.  Archive_read_open_filenames can keep a list of filenames and when it hits end-of-file on one, open the next file as necessary.  That doesn't require any changes to archive_read.c nor to the RAR reader, I don't think.
kwrobot commented 9 years ago

Comment #12 originally posted by Google Code user mcitadel on 2011-09-23T16:23:05.000Z:

Wow, I see the last comment got mangled.

So what I was trying to say, I want to make it possible for downstreams of libarchive to pass in their multiple IO handles (such as multiple FILE handles). Such is the case with XBMC which needs to be able to supply it's own callbacks and data objects.
kwrobot commented 9 years ago

Comment #13 originally posted by Google Code user mcitadel on 2011-10-15T16:16:14.000Z:

Ping.

Any other questions/concerns about this issue?
kwrobot commented 9 years ago

Comment #14 originally posted by kientzle on 2011-10-16T02:14:12.000Z:

My only concern is that your approach is overkill for what you want to do.

Since XBMC needs a custom reader anyway, that reader can itself keep a queue of data objects and start reading the next when one is complete.  There's nothing here so far that requires changes to archive_read.c.
kwrobot commented 9 years ago

Comment #15 originally posted by kientzle on 2011-11-20T23:53:34.000Z:

I'm not sure I understand this part:

In particular, *avail <= 0 seems an odd condition:  I understand that you're in the end-of-file case, so you might need to retry to span to the next volume.

But *avail < 0 is usually the error case.  Shouldn't ARCHIVE_FATAL be handled differently here?

+static const void *
+rar_read_ahead(struct archive_read *a, size_t min, ssize_t *avail)
+{
+  struct rar *rar = (struct rar *)(a->format->data);
+  const void *h = __archive_read_ahead(a, min, avail);
+  int ret;
+  if (avail && *avail > rar->bytes_remaining)
+    *avail = rar->bytes_remaining;
+  if (avail && *avail <= 0 && rar->main_flags & MHD_VOLUME &&
+    rar->file_flags & FHD_SPLIT_AFTER)
+  {
+    ret = archive_read_format_rar_read_header(a, a->entry);
+    if (ret == (ARCHIVE_EOF))
+      ret = archive_read_format_rar_read_header(a, a->entry);
+    if (ret != (ARCHIVE_OK))
+      return NULL;
+    return rar_read_ahead(a, min, avail);
+  }
+  return h;
+}
kwrobot commented 9 years ago

Comment #16 originally posted by Google Code user mcitadel on 2011-12-23T01:31:36.000Z:

Check the latest commits to the multivolume branch. I just fixed your last concern so that the error case is properly handled.

Also, some change in trunk that occurred between the summer and now has caused me to need a fix where I have to ensure the first data node is opened in archive_read_open1 after all bidding is done. It looks like I needed to make the changes in archive_read.c anyway. I'm not sure if I can implement multivolume support without touching archive_read.c.
kwrobot commented 9 years ago

Comment #17 originally posted by amejia1 on 2012-01-16T04:19:06.000Z:

<empty>

kwrobot commented 9 years ago

Comment #19 originally posted by amejia1 on 2012-02-25T17:37:41.000Z:

Changes for this is now in the form of a pull request. See https://github.com/libarchive/libarchive/pull/2
kwrobot commented 9 years ago

Comment #20 originally posted by amejia1 on 2012-03-28T21:36:46.000Z:

Pull request has been merged.