juninho12 / freearc

Automatically exported from code.google.com/p/freearc
1 stars 0 forks source link

Store identical files only once #303

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What should be implemented?
Store identical files only once in the archive, and in the header, multiple 
entries refers to same block of data.

Why it will be useful?
It will dramatically decrease the size of the output archive if there are many 
identical files in the archive.

===========================================
I don't have too much time looking into freearc's source code (a totally new 
language, makes it takes longer than a familiar language for me). But I have 
done similar things on M$-CAB before. From what I read, freearc seems to have a 
'data blocks + header' structure, so it is possible to make it.

Basically, we only focus on binary-same, so we don't care text-same in 
different encodings. To be binary-same, 2 files have to be same in size. And if 
there is one byte different, you don't need to continue comparing.

Original issue reported on code.google.com by YumeYao on 28 May 2012 at 3:09

GoogleCodeExporter commented 9 years ago
Just know that FreeArc has a filter called REP. But it seems to be file-wide, 
not block-wide.

If it can be applied block-wide, identical files should be able to be handled 
just like being stored only once.

Original comment by YumeYao on 28 May 2012 at 10:20

GoogleCodeExporter commented 9 years ago
REP is blockwise (as well as any other compression algos). but sometimes 
freearc splits files into too small solid blocks so it can't find the 
similarity. so, overall your idea shpuld be implemnted and isn't new at all. 
unfortunately, my priority now is bugfixing and GUI so it will be long time 
before i will go to implement it

Original comment by bulat.zi...@gmail.com on 28 May 2012 at 4:55

GoogleCodeExporter commented 9 years ago
I see. Just did some tests. REP worked fine on regular files, with exception of 
sound(wave) files.

For example, I have 2 identical *.wav files, then REP won't work.

If there is only one set of identical *.wav files, I can add arguments for REP 
filter to limit the minimal size so that REP won't find same blocks within the 
files(Tesulting data should still be ok to compress with TTA). <---- Haven't 
figured out how to do it yet, though.

But if there are many sets of identical *.wav files, limit the minimal size may 
not be able to work properly, therefore, the value of identifying identical 
files shows.

Original comment by YumeYao on 28 May 2012 at 5:33

GoogleCodeExporter commented 9 years ago
FreeArc will always treat single *.wav file as a solid block for TTA 
compression, where REP can't do its job.

So it turns out this enhancement is still needed.

Original comment by YumeYao on 29 May 2012 at 4:33