using cache_to fails to work

vEpiphyte commented 4 years ago

This is a independent repro of a problem related to #60 but was not tested previously.

Using the current master, when caching descriptor files, CollecTor fails to validate the cached files:

Here is my repro code

import os
import sys
import datetime

import stem.descriptor.collector as sd_collector

fdir = '/tmp/stem_bg60'  # I independently made this directory

def main():
    print(f'{fdir} contents')
    print(os.listdir(fdir))

    start = datetime.datetime(year=2020, month=8, day=1)
    end = datetime.datetime(year=2020, month=8, day=3)
    genr = sd_collector.get_server_descriptors(
        start=start,
        end=end,
        cache_to=fdir,
        timeout=600
    )

    recs = 0
    for rec in genr:
        recs = recs + 1

    print(f'got {recs} recs')

    print('running again with cache dir')

    print(f'{fdir} contents')
    print(os.listdir(fdir))

    genr = sd_collector.get_server_descriptors(
        start=start,
        end=end,
        cache_to=fdir,
        timeout=600
    )

    recs = 0
    for rec in genr:
        recs = recs + 1

    print(f'got {recs} recs')

    return 0

if __name__ == '__main__':
    sys.exit(main())

This fails as soon as I try to use the cached data with the following errors:

(stem379) epiphyte@vertex05:~/git/stem$ python bg60.py 
/tmp/stem_bg60 contents
[]
got 24465 recs
running again with cache dir
/tmp/stem_bg60 contents
['server-descriptors-2020-08.tar']
Traceback (most recent call last):
  File "bg60.py", line 52, in <module>
    sys.exit(main())
  File "bg60.py", line 44, in main
    for rec in genr:
  File "/home/epiphyte/git/stem/stem/descriptor/collector.py", line 103, in get_server_descriptors
    for desc in get_instance().get_server_descriptors(start, end, cache_to, bridge, timeout, retries):
  File "/home/epiphyte/git/stem/stem/descriptor/collector.py", line 434, in get_server_descriptors
    for desc in f.read(cache_to, desc_type, start, end, timeout = timeout, retries = retries):
  File "/home/epiphyte/git/stem/stem/descriptor/collector.py", line 273, in read
    path = self.download(directory, True, timeout, retries)
  File "/home/epiphyte/git/stem/stem/descriptor/collector.py", line 335, in download
    raise OSError("%s already exists but mismatches CollecTor's checksum (expected: %s, actual: %s)" % (path, expected_hash, actual_hash))
OSError: /tmp/stem_bg60/server-descriptors-2020-08.tar already exists but mismatches CollecTor's checksum (expected: 5f5c62fa5691d520017ef107c1d6ea4f29af2e5aabf959373da31755c30d21d8, actual: 352b10fae3e221fb3287d8e1dfd754eb43f3058d94ee8940d090f34971b01f70)

Running it again fails right away

(stem379) epiphyte@vertex05:~/git/stem$ python bg60.py 
/tmp/stem_bg60 contents
['server-descriptors-2020-08.tar']
Traceback (most recent call last):
  File "bg60.py", line 51, in <module>
    sys.exit(main())
  File "bg60.py", line 25, in main
    for rec in genr:
  File "/home/epiphyte/git/stem/stem/descriptor/collector.py", line 103, in get_server_descriptors
    for desc in get_instance().get_server_descriptors(start, end, cache_to, bridge, timeout, retries):
  File "/home/epiphyte/git/stem/stem/descriptor/collector.py", line 434, in get_server_descriptors
    for desc in f.read(cache_to, desc_type, start, end, timeout = timeout, retries = retries):
  File "/home/epiphyte/git/stem/stem/descriptor/collector.py", line 273, in read
    path = self.download(directory, True, timeout, retries)
  File "/home/epiphyte/git/stem/stem/descriptor/collector.py", line 335, in download
    raise OSError("%s already exists but mismatches CollecTor's checksum (expected: %s, actual: %s)" % (path, expected_hash, actual_hash))
OSError: /tmp/stem_bg60/server-descriptors-2020-08.tar already exists but mismatches CollecTor's checksum (expected: 5f5c62fa5691d520017ef107c1d6ea4f29af2e5aabf959373da31755c30d21d8, actual: 352b10fae3e221fb3287d8e1dfd754eb43f3058d94ee8940d090f34971b01f70)

And environment information (using ubuntu 18.04)

(stem379) epiphyte@vertex05:~/git/stem$ git rev-parse HEAD
ab835c1a2972a654af991d9690776b755a9450c1
(stem379) epiphyte@vertex05:~/git/stem$ python --version
Python 3.7.9
(stem379) epiphyte@vertex05:~/git/stem$ python -m pip freeze
appdirs==1.4.4
cffi==1.14.3
cryptography==3.1.1
distlib==0.3.1
filelock==3.0.12
importlib-metadata==1.7.0
mock==4.0.2
packaging==20.4
pluggy==0.13.1
py==1.9.0
pycodestyle==2.6.0
pycparser==2.20
pyflakes==2.2.0
pyparsing==2.4.7
six==1.15.0
toml==0.10.1
tox==3.20.0
virtualenv==20.0.31
zipp==3.2.0
(stem379) epiphyte@vertex05:~/git/stem$ uname -a
Linux vertex05 4.15.0-117-generic #118-Ubuntu SMP Fri Sep 4 20:02:41 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

atagar commented 4 years ago

I see the problem. We decompress server-descriptors-2020-08.tar.xz when first downloaded, then compare the decompressed tarball checksum with the index. Note the two hash values in the exception...

OSError: /tmp/stem_bg60/server-descriptors-2020-08.tar already exists but mismatches CollecTor's checksum (
  expected: 5f5c62fa5691d520017ef107c1d6ea4f29af2e5aabf959373da31755c30d21d8,
  actual: 352b10fae3e221fb3287d8e1dfd754eb43f3058d94ee8940d090f34971b01f70
)

... and how they compare with the following...

File's CollecTor index entry

{
  "path": "server-descriptors-2020-08.tar.xz",
  "size": 228750972,
  "last_modified": "2020-09-07 11:59",
  "types": ["server-descriptor 1.0"],
  "first_published": "2020-08-01 00:00",
  "last_published": "2020-08-31 23:59",
  "sha256": "X1xi+laR1SABfvEHwdbqTymvLlqr+Vk3PaMXVcMNIdg="
}

Index's checksum

>>> index_checksum = 'X1xi+laR1SABfvEHwdbqTymvLlqr+Vk3PaMXVcMNIdg='
>>> binascii.hexlify(base64.b64decode(index_checksum)).decode('utf-8')
'5f5c62fa5691d520017ef107c1d6ea4f29af2e5aabf959373da31755c30d21d8'

Compressed file's checksum

>>> with open('/home/atagar/Desktop/server-descriptors-2020-08.tar.xz', 'rb') as collector_file:
...   hashlib.sha256(collector_file.read()).hexdigest()
... 
'5f5c62fa5691d520017ef107c1d6ea4f29af2e5aabf959373da31755c30d21d8'

Decompressed file's checksum

>>> with open('/home/atagar/Desktop/server-descriptors-2020-08.tar', 'rb') as collector_file:
...   hashlib.sha256(collector_file.read()).hexdigest()
... 
'352b10fae3e221fb3287d8e1dfd754eb43f3058d94ee8940d090f34971b01f70'

We can fix this in a couple ways...

Cache the compressed file. This will retain our integrity check and reduce disk usage, but greatly increase the time it takes to read cached files.
Simply skip the integrity check if the cached file has been decompressed.

I'm leaning toward the later because a sluggish cache is rather unhelpful.

Thanks for catching this! Would you care to fix this or shall I?

vEpiphyte commented 4 years ago

Unfortunately, I'm not familiar enough with the internals of STEM to fix this in a timely fashion. If you've got the time to fix it, that would be great! I don't have a strong preference about the two options presented. Everyone likes a fast cache though, which is what makes them useful :)

atagar commented 3 years ago

In the end I decided to opt for the former (cache compressed files). Fix pushed...

https://gitweb.torproject.org/stem.git/commit/?id=78ad708

torproject / stem