python / cpython

The Python programming language
https://www.python.org
Other
63.88k stars 30.57k forks source link

tarfile: ignore_zeros = True exceedingly slow on a sparse tar file #85020

Open edfaf2e4-7712-4830-8988-8302b536ec2d opened 4 years ago

edfaf2e4-7712-4830-8988-8302b536ec2d commented 4 years ago
BPO 40843

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.7'] title = 'tarfile: ignore_zeros = True exceedingly slow on a sparse tar file' updated_at = user = 'https://bugs.python.org/mxmlnkn' ``` bugs.python.org fields: ```python activity = actor = 'mxmlnkn' assignee = 'none' closed = False closed_date = None closer = None components = [] creation = creator = 'mxmlnkn' dependencies = [] files = [] hgrepos = [] issue_num = 40843 keywords = [] message_count = 1.0 messages = ['370611'] nosy_count = 1.0 nosy_names = ['mxmlnkn'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = None url = 'https://bugs.python.org/issue40843' versions = ['Python 3.7'] ```

edfaf2e4-7712-4830-8988-8302b536ec2d commented 4 years ago

Consider this example replicating a real use case where I was downloading the 1.191TiB ImageNet in sequential order for \~1GiB in order to preview it:

echo "foo" > bar tar cf sparse.tar bar

!/usr/bin/env python3

# -- coding: utf-8 --

import os
import tarfile
import time

t0 = time.time()
for tarInfo in tarfile.open( 'sparse.tar', 'r:', ignore_zeros = True ):
    pass
t1 = time.time()
print( f"Small TAR took {t1 - t0}s to iterate over" )

f = open( 'sparse.tar', 'wb' )
f.truncate( 2*1024*1024*1024 )
f.close()

t0 = time.time()
for tarInfo in tarfile.open( 'sparse.tar', 'r:', ignore_zeros = True ):
    pass
t1 = time.time()
print( f"Small TAR with sparse tail took {t1 - t0}s to iterate over" )

Output:

Small TAR took 0.00020813941955566406s to iterate over Small TAR with sparse tail took 6.999570846557617s to iterate over

So, iterating over sparse holes takes tarfile \~300MiB/s. Which sounds fast but is really slow for 1.2TiB and when thinking about it as tarfile doing basically >nothing\<.

There should be better options like using os.lseek with os.SEEK_DATA if available to skip those empty holes.

An alternative would be an option to tell tarfile how many zeros it should at maximum skip. Personally, I only use the ignore_zeros option to be able to work with concatenated TARs, which in my case only have up to 19*512 byte empty tar blocks to be skipped. Anything longer would indicate an invalid file. I'm aware that these maximum runs of zeros vary depending on the tar blocking factor, so it should be adjustable.