pombreda / libarchive

Automatically exported from code.google.com/p/libarchive
Other
0 stars 0 forks source link

Support reading from multiple data objects (multivolume reading) #166

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
RAR archives can be split into multiple files (to provide multivolume support). 
Each file contains the RAR signature header, a main archive header, and the 
optional EOF header. The data blocks are split arbitrarily between each file in 
a multivolume set of files. Currently, libarchive doesn't handle reading from 
multiple files.

This patch would introduce reading from multiple files by way of reading from 
multiple client objects. What would happen is that there is a chain of client 
objects, each with the callbacks and data necessary to open, read, skip, and 
close each object it's reading from (such as different files). Data is read 
from each of these clients as one large stream. I plan on implementing 
multivolume reading support of RAR files once general reading from multiple 
streams is accepted and committed to trunk.

I introduced a new callback (switch callback) that can be used to switch from 
reading of one client to the next or previous client. I needed some way to 
determine whether a file should be closed because it's going to open the next 
file, or if it's being closed because libarchive is done reading from the file 
set. The latter would mean that I also need to free all memory allocated for 
all data objects of each client.

I've introduced some test cases already for reading from these multiple 
clients. The test files are simply some reused test rar files that have been 
splitted using the 'split' program. There's also a test case for supplying 
custom callbacks and multiple client objects. This custom callbacks test case 
is essentially the way I see of using libarchive to read from multiple files 
with custom callbacks. I plan on using libarchive in this way in another 
application (XBMC).

This patch also updates test_fuzz so it can read from multiple files. 
Currently, the multiple files used in test_fuzz would have the same result as 
"test_read_format_rar.rar" would. Once I have RARv3 multivolume reading support 
implemented, this would provide a better test for test_fuzz.

Original issue reported on code.google.com by mcitadel@gmail.com on 13 Aug 2011 at 6:39

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by mcitadel@gmail.com on 13 Aug 2011 at 6:57

GoogleCodeExporter commented 9 years ago
I deleted the current patch as I found there were some memory leaks with the 
changes I was proposing. I'll post another patch soon.

Original comment by mcitadel@gmail.com on 19 Aug 2011 at 3:35

GoogleCodeExporter commented 9 years ago
Here's a new patch with the changes I'm proposing.

This is different from the last patch. Instead of using a doubly linked list of 
client objects, an array of void pointers inside a struct is used to maintain 
multiple data objects that will be used to chain multiple streams together. 
This should be an improvement as far as performance is concerned. Reading from 
multiple sources, such as multiple files in a multivolume file set, is done by 
giving libarchive multiple data objects. You can see the changes in 
archive_read_open_filename.c to see how reading from multiple file will work.

For the cases with archives that have been split using the 'split' program, all 
format readers will not require any change to read these files.

All existing functions and callbacks will perform the same tasks. The new 
'switch' callback performs the operation of closing one file, and opening the 
next, without freeing the data objects given from a client (same proposed 
functionality as in the previous patch). There are now functions to set a data 
object at some index, add an object at an index, and append/prepend a data 
object in the array of data objects. I've not added functions to retrieve data 
objects at some index. Perhaps there should be, would welcome discussion on 
that.

The appropriate test cases are implemented including modifications to test_fuzz 
for testing multivolume files. All tests pass. Also, this timevthere are no 
memory leaks or any other memory related breakage.

Original comment by mcitadel@gmail.com on 19 Aug 2011 at 7:45

GoogleCodeExporter commented 9 years ago

Original comment by mcitadel@gmail.com on 19 Aug 2011 at 7:45

Attachments:

GoogleCodeExporter commented 9 years ago
Here's another patch that implements support for reading multivolume RAR files 
that have been split using RAR's mechanism for splitting RAR files. This patch 
applies after applying the 'multiple_data_objects.patch'

Test cases are also implemented and all pass.

Original comment by mcitadel@gmail.com on 19 Aug 2011 at 7:48

Attachments:

GoogleCodeExporter commented 9 years ago
I created a branch to make reviewing much easier I hope.

Original comment by mcitadel@gmail.com on 5 Sep 2011 at 6:28

GoogleCodeExporter commented 9 years ago
Just a general ping.

Were there any other questions or concerns about these changes?

Original comment by andres.m...@gmail.com on 6 Sep 2011 at 1:18

GoogleCodeExporter commented 9 years ago
Oops, sorry. That last message is from me.

Original comment by mcitadel@gmail.com on 6 Sep 2011 at 1:19

GoogleCodeExporter commented 9 years ago
Ping

Original comment by mcitadel@gmail.com on 22 Sep 2011 at 5:44

GoogleCodeExporter commented 9 years ago
It feels like you're doing a lot more work than necessary to handle this 
particular case.

If you are just interested in handling files created by 'split', then you can 
do that by adding archive_read_open_filenames (which you've done), but you 
don't need to make any changes to archive_read.c at all.  
Archive_read_open_filenames can keep a list of filenames and when it hits 
end-of-file on one, open the next file as necessary.  That doesn't require any 
changes to archive_read.c nor to the RAR reader, I don't think.

Original comment by kientzle@gmail.com on 23 Sep 2011 at 5:37

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Wow, I see the last comment got mangled.

So what I was trying to say, I want to make it possible for downstreams of 
libarchive to pass in their multiple IO handles (such as multiple FILE 
handles). Such is the case with XBMC which needs to be able to supply it's own 
callbacks and data objects.

Original comment by mcitadel@gmail.com on 23 Sep 2011 at 4:23

GoogleCodeExporter commented 9 years ago
Ping.

Any other questions/concerns about this issue?

Original comment by mcitadel@gmail.com on 15 Oct 2011 at 4:16

GoogleCodeExporter commented 9 years ago
My only concern is that your approach is overkill for what you want to do.

Since XBMC needs a custom reader anyway, that reader can itself keep a queue of 
data objects and start reading the next when one is complete.  There's nothing 
here so far that requires changes to archive_read.c.

Original comment by kientzle@gmail.com on 16 Oct 2011 at 2:14

GoogleCodeExporter commented 9 years ago
I'm not sure I understand this part:

In particular, *avail <= 0 seems an odd condition:  I understand that you're in 
the end-of-file case, so you might need to retry to span to the next volume.

But *avail < 0 is usually the error case.  Shouldn't ARCHIVE_FATAL be handled 
differently here?

+static const void *
+rar_read_ahead(struct archive_read *a, size_t min, ssize_t *avail)
+{
+  struct rar *rar = (struct rar *)(a->format->data);
+  const void *h = __archive_read_ahead(a, min, avail);
+  int ret;
+  if (avail && *avail > rar->bytes_remaining)
+    *avail = rar->bytes_remaining;
+  if (avail && *avail <= 0 && rar->main_flags & MHD_VOLUME &&
+    rar->file_flags & FHD_SPLIT_AFTER)
+  {
+    ret = archive_read_format_rar_read_header(a, a->entry);
+    if (ret == (ARCHIVE_EOF))
+      ret = archive_read_format_rar_read_header(a, a->entry);
+    if (ret != (ARCHIVE_OK))
+      return NULL;
+    return rar_read_ahead(a, min, avail);
+  }
+  return h;
+}

Original comment by kientzle@gmail.com on 20 Nov 2011 at 11:53

GoogleCodeExporter commented 9 years ago
Check the latest commits to the multivolume branch. I just fixed your last 
concern so that the error case is properly handled.

Also, some change in trunk that occurred between the summer and now has caused 
me to need a fix where I have to ensure the first data node is opened in 
archive_read_open1 after all bidding is done. It looks like I needed to make 
the changes in archive_read.c anyway. I'm not sure if I can implement 
multivolume support without touching archive_read.c.

Original comment by mcitadel@gmail.com on 23 Dec 2011 at 1:31

GoogleCodeExporter commented 9 years ago

Original comment by amejia004@gmail.com on 16 Jan 2012 at 4:19

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Changes for this is now in the form of a pull request. See 
https://github.com/libarchive/libarchive/pull/2

Original comment by amejia004@gmail.com on 25 Feb 2012 at 5:37

GoogleCodeExporter commented 9 years ago
Pull request has been merged.

Original comment by amejia004@gmail.com on 28 Mar 2012 at 9:36