add ability to stream source and not requre the entire source in memory

GoogleCodeExporter commented 9 years ago

Currently the entire source is required to be in memory meaning it can not 
handle large files or stream from a network.

Original issue reported on code.google.com by lordje...@gmail.com on 21 Oct 2008 at 10:48

GoogleCodeExporter commented 9 years ago

Original comment by openvcd...@gmail.com on 24 Oct 2008 at 12:27

Changed state: Accepted
Added labels: Priority-Low
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Original comment by openvcd...@gmail.com on 6 Aug 2010 at 10:56

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

This would be a very useful feature for our use.  Some of our files are very 
large, around 500 MB, and the device where we're decoding is memory 
constrained.  Being able to decode on an incremental basis without exhausting 
memory is the way to go.

Original comment by john.b.s...@gmail.com on 29 Oct 2010 at 4:14

GoogleCodeExporter commented 9 years ago

The source file is less of a problem, assuming your target platform supports 
memory mapped files. Passing a read-only mmapped region as the source 
dictionary works well in practice. Streaming over a network is always going to 
be difficult due to the random-access nature of the dictionary lookups. You'd 
have to reengineer the diff format to be ordered by source address, not target 
address to allow source streaming.

The bigger memory issue I've encountered (on iOS devices) is that the entire 
decoded *target* file is kept in memory at once, making the streaming output 
interface (via implementing OutputString) somewhat farcical. There doesn't seem 
to be an interface for custom allocation of the target memory, which would 
allow passing in a read-write memory mapped file. With such a setup, the OS 
could evict the regions which aren't used for back-references within the target 
file.

I'll look into patching the code to allow for such a mode of operation.

Original comment by phil%phi...@gtempaccount.com on 29 Nov 2010 at 4:09

GoogleCodeExporter commented 9 years ago

Very good news.  We're not planning on streaming over a network.  We'd get the 
dictionary file to the device via a delivery channel, but would then want to be 
able to apply the diff and read the source file in manageable chunks.  And of 
course, writing out the target file as it is created rather than keeping it all 
in memory until the end would make this feasible on memory limited devices.

Thanks!

Original comment by john.b.s...@gmail.com on 29 Nov 2010 at 5:41

GoogleCodeExporter commented 9 years ago

> The bigger memory issue I've encountered (on iOS devices)
> is that the entire decoded *target* file is kept in memory
> at once, making the streaming output interface
> (via implementing OutputString) somewhat farcical.
> There doesn't seem to be an interface for custom allocation
> of the target memory, which would allow passing in a
> read-write memory mapped file. With such a setup,
> the OS could evict the regions which aren't used for
> back-references within the target file.
>
> I'll look into patching the code to allow for such a mode
> of operation.

Please take a look at issues 9 and 19.  As long as you specify the option 
--allow_vcd_target=false, or call VCStreamingDecoder::SetAllowVcdTarget(false), 
only the current target window will be kept in memory, not the whole target 
file.  Using this option (perhaps in combination with --max_target_file_size) 
will hopefully mitigate the problem you've described.

Your suggestion of using a memory-mapped file for output is intriguing.  Issue 
12 includes a related suggestion of passing a memory buffer to the decoder, 
which should fill it and report whether the input has been exhausted or not.

Original comment by openvcd...@gmail.com on 30 Nov 2010 at 12:46

GoogleCodeExporter commented 9 years ago

I encountered the allow_vcd_target option while digging through the source to 
diagnose this problem, but it wasn't clear to me whether it was safe to use. 
Thanks for clearing this up.

I have however meanwhile hacked together a change to the decoder which allows 
supplying your own target buffer manager instead of using std::string for 
decoded_target_. This could be useful even where you can hold the whole file in 
memory, as resizing the buffer can be implemented without a copy using mmap()ed 
anonymous memory. (or indirectly via realloc()) Using std::string with the 
default allocator also causes some severe memory fragmentation...

In my case, in iOS, I implement a buffer which mmap()s the output file, with 
reserve() resizing the file with ftruncate() and remapping. The streaming 
output is just a no-op dummy. This works well, although the API is very hacky 
at the moment (passes all tests, however). I'll try to tidy it up and post a 
patch.

In practice, I think I'll use both approaches, as I don't see any immediate 
disadvantage to the mmap() solution, and VCD_TARGET acts as a hint that a 
mapped region is no longer required.

Original comment by phil%phi...@gtempaccount.com on 30 Nov 2010 at 11:07

GoogleCodeExporter commented 9 years ago

Sounds great.  I encourage you to contribute your changes if you think they 
would be useful to others.  Please take a look at the development guide:
http://code.google.com/p/open-vcdiff/wiki/OpenVcdiffDevGuide

Original comment by openvcd...@gmail.com on 30 Nov 2010 at 5:02

tarsh / open-vcdiff

add ability to stream source and not requre the entire source in memory #17