hsiehsh168168 / warc-tools

Automatically exported from code.google.com/p/warc-tools
0 stars 0 forks source link

warcvalidator seems slow #103

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?

1. WARC Tools r245 warcvalidator.c: r211 | voidptrptr | 2008-11-07 
2. time warcvalidator -f 100MB.warc.gz

What is the expected output? What do you see instead?

expected time comparable to zcat (around 30sec, but took over 15min instead.  

Please use labels and text to provide additional information.

from warc-tools list post [Sep 14 2009, 8:46 am]:

hi WARC Tools,

warcvalidator seems to take an inordinate amount of time
to validate our warc files, which ultimately do turn out
to be valid.

  WARC Tools r245
  warcvalidator.c: r211 | voidptrptr | 2008-11-07

on an unloaded crawler with dual 2.6GHz cpus and 4GB memory,
it took about 5 hours to process a 1GB WARC (the new standard
size) and 15 minutes to validate a 100MB WARC. gzip takes about
20 seconds to unpack a 1GB WARC.

  OS: Ubuntu 5.10 "Breezy Badger"
  kernel: Linux 2.6.16.1 #1 SMP May11 2006 x86_64 GNU/Linux
  cpu: 2 x AMD 64GB 2605.873MHz
  mem: 4015252k total

during processing, the cpus were mostly idle, about 90% of
memory was in use, and disk activity was low. in top, with a
sample delay of 0.1 seconds, i could see that warcvalidator
appears briefly and then goes away for a few seconds, then
appears briefly again.

  time zcat WARC_1GB > /dev/null
  real 0m20.781s
  user 0m19.620s
  sys  0m1.150s

  time warcvalidator -f WARC_100MB
  real  15m10.865s
  user  0m8.510s
  sys   0m8.770s

  time warcvalidator -f WARC_1GB
  real    298m58.218s (4.97 hrs)
  user    1m21.760s
  sys     1m35.540s

any idea what warcvalidator is doing?

http://groups.google.com/group/warc-tools/browse_thread/thread/583fc0f407deaec7/
d24adbf3e57995b2#d24adbf3e57995b2

Original issue reported on code.google.com by st...@archive.org on 1 Apr 2010 at 9:53