opendataphilly / opendataphilly-ckan

Port of OpenDataPhilly to CKAN
3 stars 2 forks source link

Assemble incident report for 12/3 downtime #92

Closed hectcastro closed 5 years ago

hectcastro commented 5 years ago

Since 2018-12-03 01:51:08 EST, the OpenDataPhilly website was flapping between up and down until approximately 2018-12-03 07:28:18 EST.

We should include time in this estimate to dig a bit more into the root cause by using the inbound requests that were failing and mapping them back to relevant exceptions (/var/log/apache2/ckan_default.error.log).

jeancochrane commented 5 years ago

Drafting the incident report here: https://docs.google.com/document/d/1HdfB2QQWE6riJxRjzxy06klZqU4FrSJZ0C18_3i86JM/edit?usp=sharing

jeancochrane commented 5 years ago

Did a little more digging on this by comparing the access and error logs.

I'm not entirely convinced that the bots were the root problem here. Going back to logs from about a week ago, you can see that we were getting consistently busy traffic from bots even back then, and all was well:

127.0.0.1 - - [25/Nov/2018:06:45:58 +0000] "GET /dataset?_organization_limit=0&groups=health-human-services-group&groups=planning-zoning-group&groups=public-safety-group&groups=real-estate-land-records-group&license_id=other-philadelphia&organization=city-of-philadelphia&res_format=api&res_format=geojson&res_format=html&res_format=shp HTTP/1.0" 500 7656 "-" "Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)"
127.0.0.1 - - [25/Nov/2018:06:45:56 +0000] "GET /dataset?_groups_limit=0&_organization_limit=0&groups=transportation-group&res_format=CSV&res_format=shp&tags=DVRPC&tags=Freight&tags=Rail&tags=Transportation HTTP/1.0" 200 3768 "-" "Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)"
127.0.0.1 - - [25/Nov/2018:06:46:00 +0000] "GET /dataset?_groups_limit=0&groups=health-human-services-group&groups=parks-recreation-group&groups=planning-zoning-group&license_id=other-philadelphia&organization=city-of-philadelphia&published=true&res_format=CSV&tags=injury HTTP/1.0" 200 3856 "-" "Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)"
127.0.0.1 - - [25/Nov/2018:06:46:02 +0000] "GET /dataset?q=&res_format=geojson&_organization_limit=0&sort=metadata_modified+desc&_groups_limit=0&organization=city-of-philadelphia HTTP/1.0" 200 8289 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
127.0.0.1 - - [25/Nov/2018:06:46:03 +0000] "GET /dataset?_organization_limit=0&groups=transportation-group&license_id=other-philadelphia&res_format=CSV&res_format=api&tags=injury HTTP/1.0" 200 3738 "-" "Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)"
127.0.0.1 - - [25/Nov/2018:06:46:04 +0000] "GET /dataset?_groups_limit=0&_organization_limit=0&groups=economy-group&groups=planning-zoning-group&groups=public-safety-group&license_id=other-philadelphia&published=true&res_format=CSV&res_format=HTML&res_format=geojson&res_format=shp HTTP/1.0" 200 3848 "-" "Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)"
127.0.0.1 - - [25/Nov/2018:06:46:05 +0000] "GET /dataset?_res_format_limit=0&groups=boundaries&license_id=other-philadelphia&published=true&res_format=HTML&res_format=geojson&res_format=shp HTTP/1.0" 200 3759 "-" "Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)"
127.0.0.1 - - [25/Nov/2018:06:46:06 +0000] "GET /dataset?_groups_limit=0&_organization_limit=0&_tags_limit=0&groups=health-human-services-group&license_id=notspecified&res_format=CSV&res_format=geojson&res_format=shp HTTP/1.0" 200 3742 "-" "Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)"
127.0.0.1 - - [25/Nov/2018:06:46:06 +0000] "GET /dataset?_res_format_limit=0&_tags_limit=0&organization=delaware-valley-regional-planning-commission&published=true&res_format=geojson&res_format=shp&tags=Buildings&tags=Pennsylvania&tags=Structures HTTP/1.0" 200 3820 "-" "Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)"
127.0.0.1 - - [25/Nov/2018:06:46:07 +0000] "GET /dataset?license_id=other-philadelphia&res_format=api&res_format=shp&tags=Philadelphia%2BAirport HTTP/1.0" 200 3689 "-" "Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)"

I think the issue may have actually been a race condition that pushed the SQLAlchemy session into a degraded state, one that was fixed by restarting Apache. The logs seem fine until 06:28:19 UTC last night, when the server tries to respond to two requests that subsequently fail, one of which is from a bot and one of which appears to be from a person:

127.0.0.1 - - [03/Dec/2018:06:28:19 +0000] "GET /dataset?_organization_limit=0&_tags_limit=0&groups=economy-group&groups=planning-zoning-group&license_id=other-philadelphia&published=true&res_format=CSV&res_format=HTML&res_format=geojson&res_format=geoservice HTTP/1.0" 500 805 "-" "Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)"
127.0.0.1 - - [03/Dec/2018:06:28:19 +0000] "GET /group/education-group HTTP/1.0" 500 809 "https://www.opendataphilly.org/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"

Both of these requests raise different errors in the same code block of the ckanext-pages plugin, which we use to generate page templates. The human request complains that a table mapping is misconfigured in some way I don't completely understand:

[Mon Dec 03 06:28:21.960959 2018] [:error] [pid 17417:tid 140200774108928] [remote 127.0.0.1:56339] Error - <class 'sqlalchemy.exc.ArgumentError'>: When configuring property 'created' on Mapper|_Page|ckanext_pages, column 'ckanext_pages.created' is not represented in the mapper's table. Use the `column_property()` function to force this column to be mapped as a read-only attribute.

While the bot request complains that a primary table mapping is already defined, which shouldn't be possible:

[Mon Dec 03 06:28:21.964167 2018] [:error] [pid 17417:tid 140200765716224] [remote 127.0.0.1:46099] Error - <class 'sqlalchemy.exc.ArgumentError'>: Class '<class 'ckanext.pages.db._Page'>' already has a primary mapper defined. Use non_primary=True to create a non primary Mapper.  clear_mappers() will remove *all* current mappers from all classes.

Both of these errors occur when trying to execute the same block of code in ckanext-pages, one that tries to create a pages table when it's not available:

def _pages_list(context, data_dict):
    search = {}
    if db.pages_table is None:
        db.init_db(context['model'])

The call to db.init_db() is supposed to finish by initializing the table mapping:

    global pages_table
    pages_table = sa.Table('ckanext_pages', model.meta.metadata,
        sa.Column('id', types.UnicodeText, primary_key=True, default=make_uuid),
        sa.Column('title', types.UnicodeText, default=u''),
        sa.Column('name', types.UnicodeText, default=u''),
        sa.Column('content', types.UnicodeText, default=u''),
        sa.Column('lang', types.UnicodeText, default=u''),
        sa.Column('order', types.UnicodeText, default=u''),
        sa.Column('private',types.Boolean,default=True),
        sa.Column('group_id', types.UnicodeText, default=None),
        sa.Column('user_id', types.UnicodeText, default=u''),
        sa.Column('publish_date', types.DateTime),
        sa.Column('page_type', types.DateTime),
        sa.Column('created', types.DateTime, default=datetime.datetime.utcnow),
        sa.Column('modified', types.DateTime, default=datetime.datetime.utcnow),
        sa.Column('extras', types.UnicodeText, default=u'{}'),
        extend_existing=True
    )

    model.meta.mapper(
        Page,
        pages_table,
    )

In this context, the bot error smells like a race condition to me: the only way I can see that the application would reach this code block and then complain that a mapping already exists is if the same code block was being executed coterminously by another thread.

After these errors, some requests begin to throw AttributeErrors with an esoteric SQLAlchemy traceback coming from the same template generation code in ckanext-pages. I'll save you the traceback here, but the gist is that the plugin tries to run an all() query on a queryset of Pages, and SQLAlchemy complains that a query strategy that it expects to exist doesn't in fact exist. This is the error that starts to raise Panopta's alarms, and a weird aspect of it is that some requests go through just fine and some requests throw this error, which caused Panopta to report it as a sequence of short outages instead of one big failure.

I can think of two possible reasons for this behavior:

I'd need to dig into the SQLAlchemy and Apache/Nginx configurations before being able to assess these possibilities, but until then I'm curious how this diagnosis looks to others.

jeancochrane commented 5 years ago

More interesting data: You can see from the two error logs above that the errors originate in the same PID, meaning the same worker handled those requests. In the logs that follow, failing requests consistently come from the 17417 worker, while the 17416 worker is doing fine, e.g.:

[Mon Dec 03 06:28:22.148542 2018] [:error] [pid 17417:tid 140200782501632] [remote 127.0.0.1:51987] AttributeError: strategy
[Mon Dec 03 06:28:22.578903 2018] [:error] [pid 17416:tid 140200824465152] 2018-12-03 06:28:22,578 INFO  [ckan.config.environment] Loading static files from public

This seems to be compelling evidence that the occasional 200s weren't coming from cached pages, but were rather being handled by the healthy worker.

jeancochrane commented 5 years ago

Final draft of the incident report is available here: https://docs.google.com/document/d/1HdfB2QQWE6riJxRjzxy06klZqU4FrSJZ0C18_3i86JM/edit?usp=sharing

Three follow-up issues are now open to capture our corrective and preventative measures: #91, #93, and #94.