palewire / django-calaccess-downloads-website

An open-source archive of campaign finance and lobbying disclosure data from the California Secretary of State’s CAL-ACCESS database
http://calaccess.californiacivicdata.org
MIT License
3 stars 4 forks source link

Raw Data not being updated? #170

Closed gordonje closed 7 years ago

gordonje commented 7 years ago

Just noticed that our website has not been updated since Sunday.

Looking into the log on the server, but it appears as though new snapshots are not being released.

>>> python manage.py shell
>>> import requests 
>>> r = requests.head('http://campaignfinance.cdn.sos.ca.gov/dbwebexport.zip')
>>> r.headers['Last-Modified']
'Sun, 19 Mar 2017 11:20:54 GMT'
gordonje commented 7 years ago

Checked the server log. Since the last update on Sunday, there have been about 15 update attempts, most of which are throwing this traceback error:

Traceback (most recent call last):
  File "/apps/calaccess/repo/manage.py", line 35, in <module>
    execute_from_command_line(sys.argv)
  File "/apps/calaccess/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 367, in execute_from_command_line
    utility.execute()
  File "/apps/calaccess/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 359, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/apps/calaccess/local/lib/python2.7/site-packages/django/core/management/base.py", line 305, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/apps/calaccess/local/lib/python2.7/site-packages/django/core/management/base.py", line 356, in execute
    output = self.handle(*args, **options)
  File "/apps/calaccess/repo/calaccess_website/management/commands/updatedownloadswebsite.py", line 35, in handle
    super(Command, self).handle(*args, **options)
  File "/apps/calaccess/local/lib/python2.7/site-packages/calaccess_raw/management/commands/updatecalaccessrawdata.py", line 124, in handle
    download_metadata = self.get_download_metadata()
  File "/apps/calaccess/local/lib/python2.7/site-packages/calaccess_raw/management/commands/__init__.py", line 47, in get_download_metadata
    last_modified = request.headers['last-modified']
  File "/apps/calaccess/local/lib/python2.7/site-packages/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'last-modified'

These appear to be cases where the response to our HEAD request does not include a Last-Modified value.

In the remaining cases, the Last-Modified and Content-Length values were identical to what we had on Sunday.

Will keep this issue open until the regular updates start coming in.

Also might want to catch and log the status code of the head response. That would be a change in the raw-data app.

palewire commented 7 years ago

logger sounds like a great idea either way

gordonje commented 7 years ago

Still no update today. I'm now getting 504 (Gateway Time-out) errors.

gordonje commented 7 years ago

Seems like the SoS IT people have resolved the issue on their end. On Friday (24/Mar/2017 23:45:02 server time) our downloads-website ec2 instance logged a new version of CAL-ACCESS. I also just got an email from David Walker in the SoS office, stating that this has been resolved.

However, our website builds are still behind. The process is failing during the cleancalaccessrawfile command. Here's the traceback:

Traceback (most recent call last):
  File "/apps/calaccess/repo/manage.py", line 35, in <module>
    execute_from_command_line(sys.argv)
  File "/apps/calaccess/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 367, in execute_from_command_line
    utility.execute()
  File "/apps/calaccess/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 359, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/apps/calaccess/local/lib/python2.7/site-packages/django/core/management/base.py", line 294, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/apps/calaccess/local/lib/python2.7/site-packages/django/core/management/base.py", line 345, in execute
    output = self.handle(*args, **options)
  File "/apps/calaccess/repo/calaccess_website/management/commands/updatedownloadswebsite.py", line 35, in handle
    super(Command, self).handle(*args, **options)
  File "/apps/calaccess/local/lib/python2.7/site-packages/calaccess_raw/management/commands/updatecalaccessrawdata.py", line 308, in handle
    self.clean()
  File "/apps/calaccess/local/lib/python2.7/site-packages/calaccess_raw/management/commands/updatecalaccessrawdata.py", line 385, in clean
    keep_file=self.keep_files,
  File "/apps/calaccess/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 113, in call_command
    command = load_command_class(app_name, command_name)
  File "/apps/calaccess/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 40, in load_command_class
    module = import_module('%s.management.commands.%s' % (app_name, name))
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/apps/calaccess/local/lib/python2.7/site-packages/calaccess_raw/management/commands/cleancalaccessrawfile.py", line 15, in <module>
    from csvkit import reader, writer
  File "/apps/calaccess/local/lib/python2.7/site-packages/csvkit/__init__.py", line 15, in <module>
    import agate
  File "/apps/calaccess/local/lib/python2.7/site-packages/agate/__init__.py", line 5, in <module>
    from agate.aggregations import *
  File "/apps/calaccess/local/lib/python2.7/site-packages/agate/aggregations/__init__.py", line 20, in <module>
    from agate.aggregations.all import All  # noqa
  File "/apps/calaccess/local/lib/python2.7/site-packages/agate/aggregations/all.py", line 4, in <module>
    from agate.data_types import Boolean
  File "/apps/calaccess/local/lib/python2.7/site-packages/agate/data_types/__init__.py", line 14, in <module>
    from agate.data_types.date import Date  # noqa
  File "/apps/calaccess/local/lib/python2.7/site-packages/agate/data_types/date.py", line 5, in <module>
    import isodate

So we are importing csvkit which is importing agate which is importing isodate which is not found. This is probably something I screwed up the last time I deployed the website when updating to the raw-data django app to the latest version.

I tried running pip install isodate on the server, and got this error:

Downloading/unpacking isodate
  Downloading isodate-0.5.4.tar.gz
Cleaning up...
setuptools must be installed to install from a source distribution
Storing debug log for failure in /home/ccdc/.pip/pip.log

The traceback in the log file says:

Traceback (most recent call last):
  File "/apps/calaccess/local/lib/python2.7/site-packages/pip/basecommand.py", line 122, in main
    status = self.run(options, args)
  File "/apps/calaccess/local/lib/python2.7/site-packages/pip/commands/install.py", line 278, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/apps/calaccess/local/lib/python2.7/site-packages/pip/req.py", line 1229, in prepare_files
    req_to_install.run_egg_info()
  File "/apps/calaccess/local/lib/python2.7/site-packages/pip/req.py", line 292, in run_egg_info
    logger.notify('Running setup.py (path:%s) egg_info for package %s' % (self.setup_py, self.name))
  File "/apps/calaccess/local/lib/python2.7/site-packages/pip/req.py", line 269, in setup_py
    "setuptools must be installed to install from a source "
InstallationError: setuptools must be installed to install from a source distribution

It's at this point I decide that it's time to upgrade pip:

$ pip install -U pip
Downloading/unpacking pip from https://pypi.python.org/packages/b6/ac/7015eb97dc749283ffdec1c3a88ddb8ae03b8fad0f0e611408f196358da3/pip-9.0.1-py2.py3-none-any.whl#md5=297dbd16ef53bcef0447d245815f5144
  Downloading pip-9.0.1-py2.py3-none-any.whl (1.3MB): 1.3MB downloaded
Installing collected packages: pip
  Found existing installation: pip 1.5.4
    Uninstalling pip:
      Successfully uninstalled pip
Successfully installed pip
Cleaning up...

But then I get a different error when I try pip install isodate again:

Collecting isodate
/apps/calaccess/local/lib/python2.7/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:318: SNIMissingWarning: An HTTPS request has been made, but the SNI (Subject Name Indication) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/security.html#snimissingwarning.
  SNIMissingWarning
/apps/calaccess/local/lib/python2.7/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:122: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
  Downloading isodate-0.5.4.tar.gz
Could not import setuptools which is required to install from a source distribution.
Traceback (most recent call last):
  File "/apps/calaccess/local/lib/python2.7/site-packages/pip/req/req_install.py", line 387, in setup_py
    import setuptools  # noqa
  File "/apps/calaccess/local/lib/python2.7/site-packages/setuptools/__init__.py", line 12, in <module>
    import setuptools.version
  File "/apps/calaccess/local/lib/python2.7/site-packages/setuptools/version.py", line 1, in <module>
    import pkg_resources
  File "/apps/calaccess/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 72, in <module>
    import packaging.requirements
  File "/apps/calaccess/local/lib/python2.7/site-packages/packaging/requirements.py", line 9, in <module>
    from pyparsing import stringStart, stringEnd, originalTextFor, ParseException
ImportError: No module named pyparsing

/apps/calaccess/local/lib/python2.7/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:122: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning

We are currently running Python 2.7.6 on our server, so I followed the above suggestion and update to the latest version: 2.7.13. These instructions seemed suitable enough. Though, they've lead me to create a separate virtualenv. Will need to go back and incorporate some of these updates into our chef recipes.

gordonje commented 7 years ago

Still unpacking all of this. Here's another interesting tidbit: It appears as though 'Last-modified' in the header can be off by about a half minute:

In [1]: import requests

In [2]: url = 'http://campaignfinance.cdn.sos.ca.gov/dbwebexport.zip'

In [3]: r = requests.head(url)

In [4]: r.headers['Last-modified']
Out[4]: 'Tue, 28 Mar 2017 11:20:55 GMT'

In [5]: r = requests.head(url)

In [6]: r.headers['Last-modified']
Out[6]: 'Tue, 28 Mar 2017 11:20:28 GMT'

Which is causing our download/update process to treat these as separate releases. Might need to replace the logic that compares the exact values of 'Last-modified' to check instead if they are within a minute of each other (or thereabouts).

palewire commented 7 years ago

As of this morning July 5, 2017, the CAL-ACCESS bulk download has not updated in five days since June 30, 2017.

$ date
Wed Jul  5 12:07:58 PDT 2017
$ curl -I HEAD http://campaignfinance.cdn.sos.ca.gov/dbwebexport.zip
HTTP/1.1 200 OK
Server: Apache/2.2.3 (Red Hat)
Last-Modified: Fri, 30 Jun 2017 11:20:28 GMT
ETag: "2320c8-305b5d54-9ab7f700"
Accept-Ranges: bytes
Content-Length: 811294036
Content-Type: application/zip
Date: Wed, 05 Jul 2017 19:08:01 GMT
Connection: keep-alive
palewire commented 7 years ago

Despite assurances otherwise from the Secretary of State office, as of 7 AM this morning the raw data download still has not updated.

>>> import requests
>>> url = 'http://campaignfinance.cdn.sos.ca.gov/dbwebexport.zip'
>>> r = requests.head(url)
>>> r.headers['Last-modified']
'Fri, 30 Jun 2017 11:20:28 GMT'
palewire commented 7 years ago

Looks like this was fixed.