internetarchive / warcprox

WARC writing MITM HTTP/S proxy
378 stars 54 forks source link

Optimise WarcWriter.maybe_size_rollover() #136

Closed vbanos closed 5 years ago

vbanos commented 5 years ago

Every time we write WARC records to file, we call maybe_size_rollover() to check if the current WARC filesize is over the rollover threshold. We use os.path.getsize which does a disk stat to do that.

We already know the current WARC file size from the WARC record offset (self.f.tell()). There is no need to call os.path.getsize, we just reuse the offset info.

This way, we do one less disk stat every time we write to WARC which is a nice improvement.

vbanos commented 5 years ago

When debugging this improvement, I printed the values of self.f.tell() and os.path.getsize(self.path) and saw that they are really close but not exactly equal. This may have something to do with the way the file offset and stat commands calculate file size. This doesn't affect the correctness of this improvement as their difference is minimal.

SIZE 1966
TELL 1976
SIZE 2332
TELL 2342
SIZE 2995
TELL 3005
SIZE 3360
TELL 3370
SIZE 4025
TELL 4035
SIZE 4391
TELL 4401
...
...
SIZE 37605
TELL 37615
SIZE 38161
TELL 38171
SIZE 38668
TELL 38678
SIZE 47901
TELL 47911
SIZE 48422
TELL 48432
SIZE 49040
TELL 49050
SIZE 49536
TELL 49546
SIZE 75077
TELL 75087
SIZE 75589
TELL 75599
SIZE 145951
TELL 145961
SIZE 146466
TELL 146476
nlevitt commented 5 years ago

Thanks!