scrapinghub / exporters

Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations
BSD 3-Clause "New" or "Revised" License
40 stars 10 forks source link

fs_reader: speed up json lines reading #269

Closed immerrr closed 8 years ago

immerrr commented 8 years ago

GzipFile does buffer-related stuff, such as GzipFile.readline in pure Python and does it quite slowly, unlike io.BufferedReader which does it in C.

Check out

immerrr commented 8 years ago

Also, this: https://www.reddit.com/r/Python/comments/2olhrf/fast_gzip_in_python/

eliasdorneles commented 8 years ago

Looks good, thanks @immerrr ! Could you share your benchmarking results?

immerrr commented 8 years ago

The timings are as follows:

$ time python test_gzip.py ds_dump_US_1.jl.gz 
1691290714

real    0m16.099s
user    0m15.968s
sys 0m0.116s

$ time python test_gzip_buf.py ds_dump_US_1.jl.gz 
1691290714

real    0m12.040s
user    0m11.904s
sys 0m0.120s

With test_gzip.py being:

import gzip
import sys

total_bytes = 0
with gzip.open(sys.argv[1], 'rb') as f:
    for l in f:
        total_bytes += len(l)
print(total_bytes)

and test_gzip_buf.py being:

import gzip
import sys
import io

total_bytes = 0
with gzip.open(sys.argv[1], 'rb') as f:
    with io.BufferedReader(f) as bf:
        for l in bf:
            total_bytes += len(l)
print(total_bytes)
eliasdorneles commented 8 years ago

neat, thank you!