Closed GoogleCodeExporter closed 8 years ago
warcdump -- speed & memory use:
Quick tests comparing the speed of warcdump on a 1.1GB warc file:
Command "time cat five.warc > /dev/null"
Result: takes about 20 seconds / 2.9K memory (measured by top).
Command "time ../warc-tools-read-only/app/warcdump -f five.warc > /dev/null"
Result: takes about 2.6 seconds / 1.8K memory (measured by top).
Memory use appears to be constant by file size.
arc2warc:
Quick tests on arc2warc reveal constant amount of memory used even for larger
ARC
files (1GB file converted over about 1:11 uses same memory throughout, less
than 2K
in top). Same goes for 5.5GB file.
Original comment by gordon.p...@gmail.com
on 7 Nov 2008 at 3:32
Tried to convert 5.5GB arc file to warc with the command:
../warc-tools-read-only/app/arc2warc -a six.arc.gz -f siz.warc
but command halted with this error:
> debug: lib/private/wfile.c :1971:"couldn't add record to the warc file,
maximum
size reached"
Original comment by gordon.p...@gmail.com
on 7 Nov 2008 at 3:39
Younes -- Could you let us know what size is the maximum size for this command,
and
does it apply to others?
Original comment by gordon.p...@gmail.com
on 7 Nov 2008 at 3:40
Reopened, discussing with Younes as there is a maximum WARC size set in the
code.
Younes wrote:
Actually, we're using a warc_u32_t (i.e. 32 bits unsigned int =
4,294,967,295 = 4Gb of length) to handle WARC file.
We tought that's a good strategy to let you think and avoid having big
WARCs. Something between
100 Mo and 600 Mo (and even 1Go) is a good choice in my opinion
(minimize the risk of data loss, pretty fast data copying ...).
This is what I.A, Hanzo and others use in general.
In the file "app/arc2warc.c", you can find a 32 bits integer constant
called :
#define WARC_MAX_SIZE 1629145600
This is your limit actually. You can increase this value to 4Gb at max
and try again:
#define WARC_MAX_SIZE 4294967296
Original comment by gordon.p...@gmail.com
on 12 Nov 2008 at 12:07
Original comment by gordon.p...@gmail.com
on 12 Nov 2008 at 12:08
Original comment by gordon.p...@gmail.com
on 12 Nov 2008 at 12:08
From Younes new release message today:
* Support for large WARC files up to 18 Exa Byte size.
* Support for large WARC records up to 18 Exa Byte size each.
Original comment by gordon.p...@gmail.com
on 17 Nov 2008 at 11:00
Repeated test on 5.5GB ARC:
../warc-tools-read-only/app/arc2warc -a six.arc.gz -f six.warc
This works fine, output as follows:
-rw-r--r-- 1 paynter paynter 6495613350 2008-11-18 12:14 six.warc
Original comment by gordon.p...@gmail.com
on 18 Nov 2008 at 12:40
Original issue reported on code.google.com by
gordon.p...@gmail.com
on 27 Jul 2008 at 10:23