openhatch / oh-bugimporters

Bug importers for the OpenHatch project oh-mainline
https://oh-bugimporters.readthedocs.org/
GNU Affero General Public License v3.0
12 stars 28 forks source link

SCons: Bug crawling process still throws errors due to empty issue descriptions #108

Closed ehashman closed 9 years ago

ehashman commented 9 years ago

@dirkbaechle wrote:

@ehashman @paulproteus After some magic fix on your side (Big thanks for this!), basic bug scraping seems to work again (see latest logs at http://inside.openhatch.org/crawl-logs/ )...but still no SCons bugs show up on our project's page. When looking at the latest log, there is the message:

IntegrityError: (1048, "Column 'description' cannot be null")

at the end. This looks like a Django error to me, so I don't have any idea how to fix this. Checking the different issue entries there are indeed several where no description is given. So either the field should be allowed to get empty, or during importing the bugs one could copy the "title" to the description as a fallback...and assign the string "empty" as a final resort. However, some action has to be taken to allow for empty descriptions (and some other fields as well probably), because counting on external projects to stick to certain guidelines for entering bugs is just hopeless. ;)

Regards,

Dirk

I'd lean towards believing this is a bug in the implementation of the SCons crawler (we usually don't put a restriction on fields being null unless that's important), so I'll take a look there. If we decide it doesn't make sense to have a restriction on this field, we can also rip that out. That's a super easy fix (involves adding an optional parameter in a model somewhere) but that's not going to play nice with the migrations, so I'm going to look into option 1 first.

dirkbaechle commented 9 years ago

Thanks for having a look...although I think I checked the latest logs for this, and I couldn't find any SCons issues with an empty description. And since we're the first to use the tigris bug importer, chances are high that we're currently the only ones.

But two pairs of eyes always see more than one. ;)

dirkbaechle commented 9 years ago

I wrote a small script (see below) to find out projects that have non-closed bugs listed in the scraper log, but no entries (existing_bug_urls) in the OpenHatch database itself. Currently affected projects are:

aida, Postorius (GNU Mailman Web UI), BleachBit, solum, zero-k, Evennia, Miranda Bug Tracker, SCons, GNU Mailman, ConnId, py2c, rietveld, Spyder

, based on the latest log from inside.openhatch.org:

python scan_ohbugs.py scrapy.2015-01-04.ZfPH.log

Script (scan_ohbugs.py):

import os
import sys
import re
import urllib
import urllib2

try:
    import simplejson as json
except ImportError:
    import json

re_project = re.compile("'_project_name': u*'([^']+)'")
re_tracker = re.compile("'_tracker_name': u*'([^']+)'")
re_status = re.compile("'status': u*'([^']+)'")

has_bugs = {}

cproject = None
ctracker = None

with open(sys.argv[1], "r") as f:
    print "Parsing %s..." % sys.argv[1]
    for l in f.readlines():
        l = l.rstrip('\n')
        m = re_project.search(l)
        if m:
            cproject = m.group(1)
        m = re_tracker.search(l)
        if m:
            ctracker = m.group(1)
        m = re_status.search(l)
        if m:
            if m.group(1).lower() != "closed" and cproject not in has_bugs:
                has_bugs[cproject] = ctracker

if not os.path.exists('openhatch.json'):
    # Download current database at once
    print "Downloading OpenHatch data to openhatch.json..."
    query_args = {'format' : 'json', 'limit' : '0'}
    data = urllib.urlencode(query_args)
    url = 'http://openhatch.org/+api/v1/customs/tracker_model/?'+data
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    with open('openhatch.json', "wb") as j:
        j.write(response.read())

# Open JSON data
print "Reading openhatch.json..."
tns = set()
with open('openhatch.json', 'r') as fin:
    od = json.load(fin)
    for o in od['objects']:
        if not len(o['existing_bug_urls']):
            tns.add(o['tracker_name'])

for k, v in has_bugs.iteritems():
    if ((k in tns) or (v in tns)):
        print " %s : %s" % (k, v)
dirkbaechle commented 9 years ago

Related to #58

ehashman commented 9 years ago

Per openhatch/oh-mainline#1515, this looks resolved!