mysociety / theyworkforyou

Keeping tabs on the UK's parliaments and assemblies
http://www.theyworkforyou.com/
Other
225 stars 51 forks source link

write a scraper and parser for written answers for the Scottish Parliament #217

Open mhl opened 10 years ago

mhl commented 10 years ago

Although we now have debates in the main chamber of the Scottish Parliament up on TheyWorkForYou again, another feature that was lost in the redesign of the parliament website were the written questions and answers.

Questions in the Scottish Parliament have a unique code, which can be just used in a URL - for example, SW4-1234 can be found here:

http://www.scottish.parliament.uk/parliamentarybusiness/28877.aspx?SearchType=Advance&ReferenceNumbers=S4W-1234

They can also be searched for by date range on this page:

http://www.scottish.parliament.uk/parliamentarybusiness/28877.aspx?SearchType=Advance

For reference, the XML generated by the old written answers parser is here:

http://ukparse.kforge.net/svn/parldata/scrapedxml/sp-written/

nrhorner commented 10 years ago

Hi,

Just to let you know, I'm having a go at this this. I've not written a scrapper before so it might take me a day or two. How do I submit the script?

Neil

nrhorner commented 10 years ago

I've had a bash at it. This is an example of the output I get at the moment:

<?xml version="1.0" encoding="utf-8"?>
<publicwhip>
    <ques id="S4W-17949" speakerid="" speakername=" Jackson Carlaw, West Scotland, Scottish Conservative and Unionist Party" url="">To ask 
the Scottish Government, further to the answer to question S4W-17358 by 
John Swinney on 3 October 2013, how many of the Government Car Service 
vehicles listed have been (a) purchased outright and (b) leased.</ques>
    <reply id="S4W-17949" speakerid="" speakername=""></reply>
    <ques id="S4W-17946" speakerid="" speakername=" Anne McTaggart, Glasgow, Scottish Labour" url="">To ask the Scottish 
Government how many cases of female genital mutilation have been 
reported by (a) midwives and (b) other health professionals, broken down
 by (i) country of origin of patient and (ii) NHS board.</ques>
    <reply id="S4W-17946" speakerid="" speakername=""></reply>
</publicwhip>

A few things I'm not sure about:


Do I just change the date. What about the nospeaker attribute?

This is the code I have up to now. Let me know if you'd like me to carry on with this or not. Not sure if it's any use to you

import urllib import urllib2 import re from bs4 import BeautifulSoup from xml.dom.minidom import Document import pprint import datetime

def scrape():

the_page = open("scrapetest.html", "r")
soup = BeautifulSoup(the_page, "html5lib")

parsedQuestions = []

for question in soup.findAll('tr', {'id' : re.compile('MAQA_Search_gvResults.*')}):
    q = Question()

    h = question.find('div', {'id' : re.compile('.*pnlQuestionHeader')})

    if h:
        q.header = h.find('span', {'id' : re.compile('MAQA_Search_gvResults.*')}).string

    if q.header:
        q.dateLodged = re.search('Date Lodged: (.*)', q.header).group(1)
        #print "date : " + q.dateLodged

    quesID = re.search('Question  (.*?):', q.header)
    if quesID:
        q.quesID = quesID.group(1)

    speakerName = re.search(':(.*), Date Lodged', q.header)
    if speakerName:
        q.speakerName = speakerName.group(1)

    title = question.find('span', {'id' : re.compile('.*lblQuestionTitle')})
    if title:
        q.title = unicode(title.p.string)

    replyText = question.find('span', {'id' : re.compile('.*lblAnswerText')})
    if replyText:
        q.replyText = unicode(replyText.p.string)      

    answeredBy = question.find('span', {'id' : re.compile('.*lblAnswerByMSP')})
    if answeredBy:
        q.answeredBy = unicode(answeredBy.string)

    answerDate = question.find('span', {'id' : re.compile('.*lblAnswerDate')})
    if answerDate:
        q.answerDate = unicode(answerDate.string)        

    qStatus = question.find('span', {'id' : re.compile('.*lblQuestionStatus')})
    if qStatus:
        q.qStatus = unicode(qStatus.string)

    parsedQuestions.append(q)

printXml(parsedQuestions)

def printXml(parsedQuestions):

for date, qList in groupByDate(parsedQuestions):

    doc = Document()
    base = doc.createElement('publicwhip')
    doc.appendChild(base)

    for q in qList:

        #The question
        ques = doc.createElement('ques')
        ques.setAttribute("id", q.quesID)
        ques.setAttribute("speakerid", "")
        ques.setAttribute("speakername", q.speakerName)
        ques.setAttribute("url", "")      
        base.appendChild(ques)
        qText = doc.createTextNode(q.title)
        ques.appendChild(qText)

        #the reply
        reply = doc.createElement('reply')
        reply.setAttribute("id", q.quesID)
        reply.setAttribute("speakerid", "")
        reply.setAttribute("speakername", q.answeredBy)
        base.appendChild(reply)
        replyText = doc.createTextNode(q.replyText)
        reply.appendChild(replyText)

    date = datetime.datetime.strptime(date, '%d/%m/%Y')
    newDate = date.strftime('%Y-%m-%d')
    f = open(newDate + '.xml', 'w')
    f.write( doc.toprettyxml(indent="    ", encoding="utf-8") )
    f.close()

def groupByDate(questions):

groupedQs = {}

for q in questions:
    if q.dateLodged in groupedQs:
        groupedQs[q.dateLodged].append(q)
    else:
        groupedQs[q.dateLodged] = [q]
return groupedQs.iteritems()

class Question:

quesID = ""
dateLodged = ""
header = ""
speakerName = ""
title = ""
motionText = ""
answeredBy = ""
replyText = ""
answerDate = ""
qStatus = ""

if name == "main": scrape()

nrhorner commented 10 years ago

Just to let you know. I'm making progress on this. Off to Wales now. Will try to get it finished when I get back at the start of next week

mhl commented 10 years ago

@nrhorner: Thanks for this contribution, and sorry for my delay in replying about those points - TheyWorkForYou is a largely spare-time project for me at the moment, even though I'm working for @mysociety :smiley:

How to I get the speaker ID?

You can see an example of resolving names in pyscraper/sp/parse-official-reports-new.py. That's probably over-the-top for what you need for the written answers, though - essentially you can, with the current directory being pyscraper/sp/ just import memberList:

from resolvemembernames import memberList

... and get a list of possible speaker names in reponse to passing in a name and a date:

>>> memberList.match_whole_speaker('Alexander, Ms Wendy (Paisley North) (Lab)', '2010-05-12')  
[u'uk.org.publicwhip/member/80281']
>>> memberList.match_whole_speaker('Wendy Alexander', '2010-05-12')  
[u'uk.org.publicwhip/member/80281']
>>> memberList.match_whole_speaker('Joe Bloggs', '2013-11-04')
[]

Your next question was:

How to properly format ques and reply IDs. Do I jsut change the date and keep the url the same?

The url attribute should be the URL that this information was scraped from - i.e. its original source. Question IDs are of the form:

uk.org.publicwhip/spwa/2011-03-17.S3W-40266.q0

... where the ISO8601 date is the date that the question was lodged (extracted from the scraped page), the SW3-40266 bit is the Scottish Parliament's ID for that question, and the q0 indicates that it's the first question for that ID. (Somtimes you get multiple questions rolled together and given one answer, in which case the questions' suffixes go q0, q1, etc. Replies to a question similarly have the suffix r0, r1, etc.

The <major-heading> element typically is for a date-specific title, e.g. "Written Answers Thursday 17 March 2011", and has an ID of the form: 2011-03-17.0.mh (the mh suffix stands for "major heading", and the number beforehand should incremement for later major headings from that source. (Typically there's only one, though, I think.)

The <minor-heading> element is for the subject of the question or questions, and is specific to a Scottish Parliament ID, e.g. uk.org.publicwhip/spwa/2011-03-17.S3W-40456.h That's again the ISO8601 date that the question was lodged, followed by the Scottish Parliament ID, followed by an h suffix (for "heading").

Not sure how to format the major and minor headings as in the below snippet [...] Do I just change the date. What about the nospeaker attribute?

Hopefully this is explained above. The major and minor headings for written questions / answers should always have nospeaker="True", but the <ques> and <reply> elements should have a speakername attribute and if possible a speakerid.

This is all based on the parlparse XML schema documentation here:

... see the written answers section.

I hope that's of some use - please let me know if that's unclear, it's been quite a while since I worked on the earlier version!

nrhorner commented 10 years ago

Great! Thanks for that. I'll get on with it this week.

nrhorner commented 10 years ago

@mhl: I'm making progress on this. Just a few things I need to know.

You mentioned that questions and answers sometimes get rolled-up into one. I can't find an example of this, so not sure how I would parse it.

Not sure where to get the minor heading subject text. For example here is a question:

Question S4W-01234: Willie Coffey, Kilmarnock and Irvine Valley, Scottish National Party, Date Lodged: 24/06/2011

To ask the Scottish Executive how much of the funding provided to (a) sportscotland and (b) sport governing bodies in each year since 1999 was awarded (i) without competition and (ii) following a competitive bid process.

Occasionally I get multiple possible IDs from resolvemembernames. What do I do then?

And finally. I can't get lxml.etree to write extra lines between elements as in the current xml files. Is this required? If so I can just manually create the XML as it's a very simple format.

nrhorner commented 10 years ago

I found in pyscraper/sp/parse-official-reports-new.py how to deal with multiple IDs

mhl commented 10 years ago

Hi @nrhorner - thanks for the update - I've tried to address your questions below.

Multiple questions with one answer

It looks as though the Scottish Parliament website no longer lumps together questions, but attaches the same answer to each one. For example, here was an old example of two questions with one answer:

Now that appears as two questions in the search:

So I think you probably don't need to worry about that.

Minor heading subject text

It seems as if there's no longer a useful title for the minor-heading text, so I'd just make it "Question S4W-01234".

Ambiguous results from resolvemembernames

It might be worth mentioning here some of the name and date pairings that you get multiple possible iDs for - it seems that they should really be unambiguous.

Vertical whitespace in XML output

Finally, don't worry about the extra lines between elements - lxml.etree's pretty-printed output is fine.

nrhorner commented 10 years ago

@mhl: Issues above resolved thanks. I've issued a pull request for the script