jmacd / xdelta

open-source binary diff, delta/differential compression tools, VCDIFF/RFC 3284 delta compression
http://xdelta.org
1.11k stars 184 forks source link

Create incremental patches #108

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I have a program that creates reports based on the file system each night. 
These files need to be archived over a long period. A typical file will 
have roughly 50MB.

I have started using xdelta3 to reduce the size needed for storage. Which 
means there will be one main file (50MB) and then each night I "compress" 
them with xdelta3 by creating a patch to the original file. These diffs 
start out small, but since more and more changes are being made to the 
file system the diffs grow each night.

My question now: Being on windows, is it possible to create "incremental" 
rather than "differential" patches using xdelta3? (similar to how backup 
software works)

----
Example:
Situation now:
day 1: report.rpt
day 2: report.diff1 -> report.rpt
day 3: report.diff2 -> report.rpt
day 4: report.diff3 -> report.rpt

Situation wanted:
day 1: report.rpt
day 2: report.diff1 -> report.rpt
day 3: report.diff2 -> report.diff1 -> report.rpt
day 4: report.diff3 -> report.diff2 -> report.diff1 -> report.rpt
----

I can recreate the full files of each day by doing xdelta3 -d -s 
report.rpt report.diff2 reportday3.rpt.

If I used the incremental method i would need to be doing something like 
xdelta3 -d -s report.rpt report.diff1 report.diff2 reportday3.rpt OR 
xdelta3 -d -s report.rpt report.diff1 | xdelta -d report.diff2 (?)

Is this currently possible and if yes, how?

Original issue reported on code.google.com by ralfo...@googlemail.com on 22 Feb 2010 at 3:41

GoogleCodeExporter commented 9 years ago
If I understand correctly, you're "now" situation is using what we call 
"forward" deltas. 
The "incremental" solution you're looking for would be to use "reverse" deltas, 
in 
which you always store a recent copy of the data along with a delta per day 
going 
backwards in time (i.e., reverse), where each daily delta produces the full 
data for the 
day before.

The challenge is to efficiently compute several days ago, without applying 
deltas one 
at a time.  Is that right?  That's what the "xdelta3 merge" command is for.  
I'd say it's 
somewhat experimental.  I know there are users, but there is at least one 
unconfirmed bug report.

Original comment by josh.mac...@gmail.com on 23 Feb 2010 at 4:45

GoogleCodeExporter commented 9 years ago
Hello Josh,

Thanks for replying.

I knew this was kind of tough to explain. I'll try to be more specific:

If i wasn't using xdelta3 at all, the following files would get produced:

day 1: report1.rpt (50MB)
day 2: report2.rpt (50MB)
day 3: report3.rpt (50MB)
day 4: report4.rpt (50MB)

So in order to reduce the amount of data that is stored on disk I use 
report1.rpt as 
a base file and "compress" all reports day 2+ by replacing them with a diff 
file 
(using xdelta) based on report1.rpt.

So:

day 1: report1.rpt (50MB)
day 2: report1.rpt.diff1 (2MB) based on report1.rpt.
day 3: report1.rpt.diff2 (2.2MB) based on report1.rpt
day 4: report1.rpt.diff3 (2.4MB) based on report1.rpt

Where:
xdelta3 -d -s report1.rpt.diff1 report1.rpt will give original report2.rpt
xdelta3 -d -s report1.rpt.diff2 report1.rpt will give original report3.rpt
xdelta3 -d -s report1.rpt.diff3 report1.rpt will give original report4.rpt

Alright. So now after day 1 (50MB) each night the differences between the 
current 
file and the day 1 file will be stored. This saves a lot of data but still 
there 
could (potentially) be even more saving by only saving the differences to the 
respective day before.

day 1: report1.rpt (50MB)
day 2: report1.rpt.inc1 (2MB) based on report1.rpt.
day 3: report1.rpt.inc2 (0.2MB) based on report1.rpt.inc1 based on report1.rpt
day 4: report1.rpt.inc3 (0.4MB) based on report1.rpt.inc2 based on 
report1.rpt.inc1 
based on report1.rpt

So in conclusion I'm not 100% sure if I understand your forward vs. backward 
approach. In my understanding its both the same, just that the "diff" is only 1 
patch where "inc(remental)" is chaining multiple patches together.

How to produce the patches?
Right now (diff) I'm simply using the following "compression" syntax:

day 1: nothing
day 2: xdelta3 -e -9 -s report1.rpt report2.rpt report1.rpt.diff1; del 
report2.rpt
day 3: xdelta3 -e -9 -s report1.rpt report3.rpt report1.rpt.diff2; del 
report3.rpt
day 4: xdelta3 -e -9 -s report1.rpt report4.rpt report1.rpt.diff3; del 
report4.rpt

For incremental I guess this could work:

day 1: nothing
day 2: xdelta3 -e -9 -s report1.rpt report2.rpt report1.rpt.inc1; del 
report2.rpt
day 3: xdelta3 -d -s report1.rpt report1.rpt.inc1 | xdelta3 -e -9 report3.rpt 
report1.rpt.inc2; del report3.rpt
day 4: xdelta3 -d -s report1.rpt report1.rpt.inc1 | xdelta3 -d report1.rpt.inc2 
| 
xdelta3 -e -9 report4.rpt report1.rpt.inc3; del report4.rpt

Would that work?

I was unable to test this so far since running xdelta3 with stdin as input 
gives me 
the following error on windows (3.0y): Not enough memory for this command (OS 
error 
8).

Original comment by ralfo...@googlemail.com on 23 Feb 2010 at 8:41

GoogleCodeExporter commented 9 years ago
The first scenario you explain is called "forward deltas", it has the advantage 
that 
every version can be restored by reading only two files.

The second scenario you explain seems more difficult than necessary.  If 
instead of 
storing report1.rpt and attempted to store deltas as you explained, why not 
store 
report2.rpt on the second day along with a reverse delta (report2 -> report1)?  
Then 
on day three you save report3.rpt and store the reverse delta (report 3 -> 
report2)?  
Then you can use the "xdelta3 merge" command to merge report3->report2-
>report1 and restore report1.rpt.

Still, what you propose should work and does work on UNIX.  I need to get my 
act 
(w.r.t streaming source and inputs) together on Windows, sorry I haven't made 
it 
work there yet.

Original comment by josh.mac...@gmail.com on 28 Feb 2010 at 8:10

GoogleCodeExporter commented 9 years ago
Hi Josh,

what you are proposing:
------
If instead of 
storing report1.rpt and attempted to store deltas as you explained, why not 
store 
report2.rpt on the second day along with a reverse delta (report2 -> report1)?  
Then 
on day three you save report3.rpt and store the reverse delta (report 3 -> 
report2)?  
------

If I understand your approach that would mean I would delay deleting the 
original 
"full" file of each day by one day? (Just that I understand your take)

Original comment by ralfo...@googlemail.com on 3 Mar 2010 at 3:47

GoogleCodeExporter commented 9 years ago
Given the recommendation to do reverse deltas, it would be helpful to have some 
way for xdelta to do the copy and diff generation simultaneously. I'm using 
xdelta to back up multi-gigabyte Hyper-V virtual hard disk files, and so it 
takes about 10 minutes to copy the file to the backup location, plus another 10 
minutes to generate the diff. I'm guessing the total time could be cut nearly 
in half if the copy and the diff could be done in one pass.

Original comment by mjhan...@gmail.com on 8 Dec 2011 at 4:26

GoogleCodeExporter commented 9 years ago
I am working on a Linux computer and am also interested in incremental patches. 
I currently store "report1.rpt" on disk and apply incremental patches to get 
back my current report. However, I'm having an issue with xdelta being unable 
to calculate dynamically rebuilt reports over pipes. Example :

cat report3.rpt | xdelta3 -s <(cat report1.rpt | xdelta3 -Rd report1.rpt.inc1 | 
xdelta3 -Rd report1.rpt.inc2) report1.rpt.inc3

Any idea why this isn't working?

Original comment by m...@markelee.com on 3 Oct 2014 at 6:13